Thursday 10 January 2013

Avoiding Downtime For Azure Web/Worker Role Instances when the Azure Infrastructure Upgrades the OS

I discovered a problem the other day where a typical web application that was hosted on 2 Azure web roles experienced downtime for approximately 6 minutes. I was initially alerted to the problem by PingDom which informed me that every single page across my web application went offline. The screenshot below is for one of these pages and it indicates 2 downtime events. The first of which was the actual problem and the second was something else:


PingDom's root cause analysis of the downtime simply indicated a "Timeout (> 30s)" from multiple locations around the globe. Given that every page monitored by PingDom indicated the same problem I was pretty sure the entire site went down. I quickly logged into the Azure Management Portal during the downtime event to observe the status of my web role instances and I noticed that one of them (actually the 2nd of two instances) was currently rebooting. I immediately got the idea that an Azure OS update must have been initiated and that I was observing the second of my web role instances being rebooted after an update. Note that Azure is a PAAS service so it automatically handles OS and infrastructure upgrades. This makes it easier to focus on whatever core application development you are doing (instead of administrating machines) but it does come at a small unexpected price (unintended consequences) which I will explain further.

I wanted to confirm my hypothesis that the OS upgrade had occurred around the same time. Thus I got in touch with Azure support to find out if in fact an OS upgrade had been initiated by the Azure infrastructure on December 20th around 15:20 PST. This is the reply I got:
Thank you for your patience. The behavior your perceived for the update is correct. One of the instances was brought online after an update and showed when the role was set to “started” it moved to the other machine in the update domain to update the node.

The behavior is by design, we wait for the machine to display the result as role started for the machine in order to start updating the other instance.

The ideal will be to try to lower the startup time for the application. [Unfortunately] this will happen every month for the machines since we just count on the role status to update the role instance in the other  domain.
I was also sent the following link by Azure support which talks about managing multi-instance Windows Azure applications.

The web application I have running does take a bit of time to boot-up due to the JIT compilation along with New Relic's .NET agent profile hook-in to IIS, but this usually takes several seconds not minutes to complete. What seems to be going on is that although the 2 web role instances are in different upgrade domains (upgrade domain 1 and 2 respectively), which causes any updates to happen in a non-overlapping schedule, in reality the updates can occur immediately one after the other which makes sense from an infrastructure perspective. And because the Azure infrastructure relies on the status of the actual role instance itself and NOT your own application's status, its entirely possible that when a web role instance, that was just updated (in upgrade domain 1), appears as Ready to the Azure infrastructure, the web application that runs on the instance might still be initializing. And if its still initializing and the second web role instance (in upgrade domain 2) is rebooted due to the OS upgrade, there is no longer a live web role responding to web requests.

There really was only 3 ways around this:

  1. Turn off automatic upgrades of the OS (but then who wants to do that manually given that web roles are a PAAS service after all).
  2. Figure out exactly what and why is causing the web role to come alive more slowly than expected and then spend the engineering resources to reduce this time substantially. Given that we'd never be able to drive that time down to zero there would always be a small but noticeable wait time (probably in seconds rather than minutes).
  3. Have 3 web roles instead of 2. This way OS upgrades that are rolled across upgrade domains will only ever affect 2 of the web roles at any given time (with a minor overlap which in my case was 1-2 minutes). This does cost more money to have an additional web role but its a really easy fix and depending on the size of role instance it might be the cheapest and easiest solution by far.
I chose option 3. In a startup time is usually more precious than money, so if something can be fixed easily with money instead of time... that's generally a good root to take.

------

Update: Since switching to using more than 2 web roles I have never seen this downtime issue happen again.

2 comments:

Andrew said...

Good observation.
Apparently there is no way either to schedule backups for a web role or to let Azure know the web role is up in some custom way (e.g. via some API)?

I also invite you to try our monitoring service https://anturis.com . It does not yet have as sexy UI as Pingdom (but we are working on it), but it allows for more configuration flexibility and also includes server and app monitoring. We would appreciate your feedback.

Thanks

Andrew said...

Hi Andrew,

There are a few solutions out there for monitoring Azure web roles that I know of:
- Traffic Manager Monitoring (http://msdn.microsoft.com/en-us/library/windowsazure/dn339013.aspx) but this is currently in Preview mode only and has been for quite some time so I wouldn't recommend using it in Production
- PingDom - A pretty simple service that hookups to DNS, SMTP, HTTP(S) endpoints and provides uptime/downtime info along with email, push or SMS notifications.
- AreMySitesUp (http://aremysitesup.com/) - never used them but have heard good things.

I'll checkout Anturis what sort of notification system does it have for reporting downtime events or degraded request times?