IoT Creators The Thing Migration 11/05/2021 [Status Update]

  • Deutsche Telekom IoT

    Hi all,

    we have successfully upgraded our system today, but we still need some time to stabilize the system.

    The portal is back up, but to be transparent: We are experiencing some instabilities in our platform with delivering the device messages currently [11.05.2021 - at 11:45]. We are working at full speed to fix the problems.

    [UPDATE 12:55] We see that all components of the new release are running well and therefore we decided not to rollback the platform and rather work on stabilizing it. We are currently reconfiguring the sizing of our infrastructure to resolve the problem as quickly as possible.

    [UPDATE 15:15] Unfortunately, we could not stabilize the system and we will roll back to the old version.

    We like to apologize for any inconvenience caused. We will keep you updated on the progress.

    Thanks for your understanding.

    IoT Creators Team

    11:00 - Migration is done
    11:10 - The IoT Creators Portal is back online
    11:45 - We are troubleshooting some problems in message delivery
    12:55 - The instability o is caused by the high load that we have on our productive platform. 
    13:00 - We are currently reconfiguring the sizing of our infrastructure
    15:15 - Decision: we roll back to the old version until the issues have been fixed
    15:20 - Rollback started
    15:35 - Rollback completed

  • I guess I’m not the only one who’s very unhappy with how all this went down.
    We’re receiving complaint after complaint.

    Hope this will mean you will never again do a major update during office hours.

    Not to say “I told you so” but I told you so.

  • Deutsche Telekom IoT

    Hi @magnatron I am very sorry to disappoint you. But there is chance that, the next time, it will happen again during office hours.

    For many of our customers there is no difference between daytime device data and nighttime device data. They’re both equally important.

    The problems of today lasted for a little bit longer than 4 hours. The risk of doing this during the night could mean that it can take 8 or 10 hours.

    I don’t like the idea of losing data at all. But if I have to choose, then 4 hours is better than 8 hours.

  • It certainly been a stressful day for all of us, I assume. For me, however, upgrading during office hours is still preferable since this allows me to monitor the status, and contact our customers when something goes wrong. It’s a pity the upgrade failed, but I’m glad you kept us informed during the process.

  • Deutsche Telekom IoT

    Hi @magnatron and everybody else who was affected:

    I’m really sorry that the process was not as smooth as we all have wished for. And if it caused any inconvience, I apologize for this.

    We have planned and tested this for month, but you never know how a productive system behaves even though everything went smoothly in a testing environment.

    As Afzal said, we had our reasons to do this during “office hours”. We have discussed other options, but that seemed to be the most reasonable way for us to do it as we do have critical devices sending data 24h (so from that perspective our office hours are 24/7).

    Did everything go as planned today? No. But we will learn from this and hopefully we can make it better next time. We were super transparent as soon as we realized that there were some problems and updated our community as much as possible while troubleshooting the problems.

    Hopefully the next attempt will more smooth. Thanks for your understanding.

  • Deutsche Telekom IoT

    @Cees-Meijer Thanks for the motivating words!

  • Deutsche Telekom IoT

    I’ve been in the conference call with all technicians the whole day where we first did the migration successfully and then noticed that the adapters in the platform didn’t cope with the number of messages that came in. The best thing was that every member of several teams in different countries was online the whole time and reachable and it was real teamwork and a very good collaboration between infrastructure experts, software experts, network and cloud experts, testers and project management from our side. I agree to what Afzal said - the number of device messages is very much balanced throughout the whole day even if for a specific project that would be different.
    The picture shows traffic pattern from 5.-8.5.2021
    Apologies also from my side and I’m here as well to answer all your questions

  • Ok. I understand your reasons for choosing office hours.

    But please do take to heart for next time to not send out conflicting messages:

    @afzal_m explained on 28/4 12:34 that office hours were the better choice
    but hours later @Roland-Baldin-as-Admin sent us an email in which he said you were finding another time outside office hours.
    Nothing after that.

    So we were expecting another announcement and we had no reason to notify our customers.

    Shock and awe when we started receiving complaints and received the update-email almost an hour after the end of the initial time frame stating you were trying to fix the resulting problems.

    So, I’m happy @Cees-Meijer, that you feel you were updated during the process but I definitely do not feel we were kept in the loop.

    And of course: “Shit happens!”

    But to describe the process as “not as smooth as we all have wished for” does not really seem accurate to me.

  • Deutsche Telekom IoT

    I agree the conflicting pre-migration communication to you specifically was not good at all and I’ve already apologized (see my email that I have sent to you).

    In regards to today’s communication, however, I disagree. We were super transparent. The migration itself went well, all components of the system were running. We switched the portal back on at 11:10 (yes, 10 mins. late). The problems only started to appear when the high load hit the system and as soon as we noticed we immediately started to do a communication (here at 11:45 and email at 11:55).

    We did several updates and tried to keep everyone as much as possible in the loop while focusing on fixing the issues. And I’m really appreciative for the positive messages that I have received about the transparent communication, even though not everything went as planned.

  • Does this mean that we were the only one who received the “we’re rescheduling the upgrade” email from @Roland-Baldin-as-Admin ?

    So then I am the infamous one-star-rating-customer…

  • Deutsche Telekom IoT

    CLOSED: Migration cancelled. New thread will be opened for new migration.