Every bug you fix creates two more – Azure fire edition

A recent Azure DevOps outage highlights a number of ways well-meaning engineers shoot themselves in the foot.

Background
Azure DevOps engineers sometimes need to take a point-in-time snapshot of a production database to investigate customer escalations or evaluate potential performance improvements. To ensure these snapshot databases get cleaned up, a background job runs every day and deletes them once they pass a certain age.

What triggered the outage?
During Sprint 222, we upgraded our code base to replace the deprecated Microsoft.Azure.Managment.* packages with the supported Azure.ResourceManager.* NuGet packages. This resulted in a large pull request of mechanical changes swapping out API calls. Hidden within this pull request was a typo bug in the snapshot deletion job which swapped out a call to delete the Azure SQL Database to one that deletes the Azure SQL Server that hosts the database. The conditions under which this code is run are rare and were not well covered by our tests.

We deployed Sprint 222 using our Safe Deployment Practices (SDP) to Ring 0 (our internal Azure DevOps organization) where no snapshot databases existed and so the job did not execute. After multiple days in Ring 0, we next deployed to Ring 1 where the impacted South Brazil scale unit was located. A snapshot database was present old enough to trigger the bug, and when the job deleted the Azure SQL Server, it also deleted all seventeen production databases for the scale unit. Starting at that moment, the scale unit was unable to process any customer traffic.

So much to unpack here. Let’s take it from the top.

“we upgraded our code base to replace the deprecated Microsoft.Azure.Managment.* packages with the supported Azure.ResourceManager.* NuGet packages.”

Rule 18: “Naming stuff is hard” should be amended to include “Renaming stuff is fraught with peril”. There is something deeply ingrained in the developer psyche that compels them to stress out over the name of a thing. Up to a point this is beneficial because inconsistently applied and poorly thought out naming conventions cause all kinds of confusion and delay, and keeping it tight and internally consistent goes a long way towards making wrong code look wrong. But you run into diminishing returns pretty quickly, especially when you start renaming things that have been in production for a while like Microsoft is prone to do for, um, reasons. This bit my team recently when we tried to change from Microsoft.SqlServer.Smo to Microsoft.SqlServer.Management.Smo to be on the latest version. This act of well-intentioned zen garden raking from Microsoft forced all kinds of unpleasant and error-prone busywork on everyone downstream, as evidenced by…

This resulted in a large pull request of mechanical changes swapping out API calls. Hidden within this pull request was a typo bug in the snapshot deletion job which swapped out a call to delete the Azure SQL Database to one that deletes the Azure SQL Server that hosts the database.

Nobody really reviews large pull requests, not even Azure devops engineers, and any change spurred by a renamed MS NuGet package is going to be huge because it invariably pulls in other related or required libraries, and as an added bonus there will be breaking changes because MS changed method names and signatures while they were in there because of course they did. Anyone stuck reviewing it will skim a few files, say to themselves “well if anything is really wrong testing will catch it”, and hit approve.

The conditions under which this code is run are rare and were not well covered by our tests.

…or covered at all it seems. Oops. That’s a tough problem to solve because a team’s curated testing environment will almost never match the scale and complexity of production. I’m the last one to advocate for 100% unit test coverage but perhaps functions that delete databases should get extra scrutiny. Green means go, so…

We deployed Sprint 222 using our Safe Deployment Practices (SDP) to Ring 0 (our internal Azure DevOps organization) where no snapshot databases existed and so the job did not execute. After multiple days in Ring 0, we next deployed to Ring 1 where the impacted South Brazil scale unit was located. A snapshot database was present old enough to trigger the bug, and when the job deleted the Azure SQL Server, it also deleted all seventeen production databases for the scale unit.

At this point, an Azure Fire was well and truly lit. To their credit, they rolled out the change incrementally so we didn’t get a worldwide outage. I don’t know how many “rings” are in their Safe Deployment Practices, but it could have just as easily happened that there weren’t any snapshots in Ring 1 that would have triggered the deletion. How many people would have been affected had this not been found until Ring 2 or beyond? It took 10 hours to put the candle back just for Brazil (read the article to learn how many edge cases were involved in recovering). The team has since “created a new test for the snapshot deletion job which fully exercises the snapshot database delete scenario against real Azure resources”, so this exact failure is unlikely to happen again. I’m not sure what root cause was decided on, the article doesn’t say, but I’d like to humbly suggest it was Microsoft renaming stuff simply because they can.

Leave a comment