Every bug you fix creates two more – Azure fire edition

A recent Azure DevOps outage highlights a number of ways well-meaning engineers shoot themselves in the foot.

Background
Azure DevOps engineers sometimes need to take a point-in-time snapshot of a production database to investigate customer escalations or evaluate potential performance improvements. To ensure these snapshot databases get cleaned up, a background job runs every day and deletes them once they pass a certain age.

What triggered the outage?
During Sprint 222, we upgraded our code base to replace the deprecated Microsoft.Azure.Managment.* packages with the supported Azure.ResourceManager.* NuGet packages. This resulted in a large pull request of mechanical changes swapping out API calls. Hidden within this pull request was a typo bug in the snapshot deletion job which swapped out a call to delete the Azure SQL Database to one that deletes the Azure SQL Server that hosts the database. The conditions under which this code is run are rare and were not well covered by our tests.

We deployed Sprint 222 using our Safe Deployment Practices (SDP) to Ring 0 (our internal Azure DevOps organization) where no snapshot databases existed and so the job did not execute. After multiple days in Ring 0, we next deployed to Ring 1 where the impacted South Brazil scale unit was located. A snapshot database was present old enough to trigger the bug, and when the job deleted the Azure SQL Server, it also deleted all seventeen production databases for the scale unit. Starting at that moment, the scale unit was unable to process any customer traffic.

So much to unpack here. Let’s take it from the top.

“we upgraded our code base to replace the deprecated Microsoft.Azure.Managment.* packages with the supported Azure.ResourceManager.* NuGet packages.”

Rule 18: “Naming stuff is hard” should be amended to include “Renaming stuff is fraught with peril”. There is something deeply ingrained in the developer psyche that compels them to stress out over the name of a thing. Up to a point this is beneficial because inconsistently applied and poorly thought out naming conventions cause all kinds of confusion and delay, and keeping it tight and internally consistent goes a long way towards making wrong code look wrong. But you run into diminishing returns pretty quickly, especially when you start renaming things that have been in production for a while like Microsoft is prone to do for, um, reasons. This bit my team recently when we tried to change from Microsoft.SqlServer.Smo to Microsoft.SqlServer.Management.Smo to be on the latest version. This act of well-intentioned zen garden raking from Microsoft forced all kinds of unpleasant and error-prone busywork on everyone downstream, as evidenced by…

This resulted in a large pull request of mechanical changes swapping out API calls. Hidden within this pull request was a typo bug in the snapshot deletion job which swapped out a call to delete the Azure SQL Database to one that deletes the Azure SQL Server that hosts the database.

Nobody really reviews large pull requests, not even Azure devops engineers, and any change spurred by a renamed MS NuGet package is going to be huge because it invariably pulls in other related or required libraries, and as an added bonus there will be breaking changes because MS changed method names and signatures while they were in there because of course they did. Anyone stuck reviewing it will skim a few files, say to themselves “well if anything is really wrong testing will catch it”, and hit approve.

The conditions under which this code is run are rare and were not well covered by our tests.

…or covered at all it seems. Oops. That’s a tough problem to solve because a team’s curated testing environment will almost never match the scale and complexity of production. I’m the last one to advocate for 100% unit test coverage but perhaps functions that delete databases should get extra scrutiny. Green means go, so…

We deployed Sprint 222 using our Safe Deployment Practices (SDP) to Ring 0 (our internal Azure DevOps organization) where no snapshot databases existed and so the job did not execute. After multiple days in Ring 0, we next deployed to Ring 1 where the impacted South Brazil scale unit was located. A snapshot database was present old enough to trigger the bug, and when the job deleted the Azure SQL Server, it also deleted all seventeen production databases for the scale unit.

At this point, an Azure Fire was well and truly lit. To their credit, they rolled out the change incrementally so we didn’t get a worldwide outage. I don’t know how many “rings” are in their Safe Deployment Practices, but it could have just as easily happened that there weren’t any snapshots in Ring 1 that would have triggered the deletion. How many people would have been affected had this not been found until Ring 2 or beyond? It took 10 hours to put the candle back just for Brazil (read the article to learn how many edge cases were involved in recovering). The team has since “created a new test for the snapshot deletion job which fully exercises the snapshot database delete scenario against real Azure resources”, so this exact failure is unlikely to happen again. I’m not sure what root cause was decided on, the article doesn’t say, but I’d like to humbly suggest it was Microsoft renaming stuff simply because they can.

Rule 6 – shell expansion injection attack edition

Say it with me everyone, cleverness is the mother of regret. Rachelbythebay with yet another tale of turn-by-turn navigation down the road to ruin in Fix the unit test and open a giant hole everywhere

Like most dumpster fires, it starts with a mundane business problem:

Our program was going down a road where it might need to create a series of paths. This is where you have “/my/path” and you need a couple of directories nested underneath it, so you end up with “/my/path/project/staging” and “/my/path/project/prod” and things like that.

…that someone solves in a clever way

What I found was… disturbing. I found a function that claimed to create directories, but it wasn’t just a simple passthrough to the usual C library mkdir() function…it was kicking off a subprocess to run the mkdir program

void create_a_directory(path) {
  system("mkdir -p " + path);
}

Unfortunately, in doing this, it created an enormous security hole, too. The C system() call runs subcommands *through a shell*, and whatever you pass to the shell is subject to all kinds of fun shell expansion rules…These shell expansion rules include a semicolon to split commands. This means if you can get a semicolon and another command in there, it will happily run it right after the “mkdir -p /whatever”…It’s trivial to change it to do something far nastier at that point, like opening a shell to somewhere else, exfiltrating data, or whatever. Your program is now wide open to this kind of attack and you’ve changed nothing.

And now for the best part, why it happened at all

So finally, you’re probably wondering why this happened, and why someone would change a function that called a C library function into something that ran a $(*^&$( subprocess. Ah, well, that’s easy. Someone had a unit test that called that directory creation function with a complicated path that had several elements. Sometimes, not all of the intermediate directories existed, and then it would fail. They knew that “mkdir -p” is what they’d do by hand, but they needed the program to do it for them. They changed the common library, checked it in, reran their unit test, and now it started passing, so they were done.

To paraphrase Homer Simpson: To unit tests! The cause of, and solution to, all of dev’s problems

Cleverness is the Mother of Regret

Scott Locklin on the wisdom of rules 6 and 17

One of the valuable things about popular but boring languages is that the code has been traversed many times, and routine stuff you’re likely to use in production is probably well debugged… The other benefit to boring languages is people concentrate on the problem, rather than the interesting complexities of the language itself.

https://scottlocklin.wordpress.com/2021/01/08/woo-for-its-own-sake/