Friday, November 20, 2015


Security Warnings in API Docs are not Enough

May 25, 1979: American Airlines flight 191 started down the runway at Chicago O'Hare Airport. Just before takeoff, the left engine tore itself completely off of the wing. This severed four critical hydraulic lines as well as disabling several safety systems. 20 seconds after takeoff, the lack of hydraulic pressure caused the left wing control surfaces to stop responding and the plane began to bank steeply to the left. 31 seconds after takeoff, the plane was a fireball on the ground, killing 273 people.  This remains the most deadly air accident in US history and is very well documented. While the airline industry has certainly learned a lot from this tragedy, I believe there are lessons that we, as software developers, can take from it as well.

What Happened?

Before we can draw any wisdom from this tragedy, we must understand the dramatic mechanical failure that caused the engine to free itself from the wing. The McDonnell Douglas DC-10 wing engines are attached to a large arm call the "pylon", which is then attached to the wing, as you can see here:


For various maintenance reasons, mechanics need to detach the engine and pylon from the wing. The procedure for doing this, as provided by McDonnell Douglas, calls for the removal of the engine first, followed by the removal of the pylon. However, this process is very time consuming, especially if you don't have a specific reason to detach the engine from the pylon. That's why several carriers, including American Airlines, independently developed procedures for detaching the pylon from the wing while the engine was still attached. AA's procedure involved using a fork lift to hold the engine and assembly while the pylon/wing bolts were removed and re-installed. McDonnell Douglas did not approve this procedure, and may have cautioned against it, but they could not dictate to any airline what procedures were used.

As it turns out, it is very difficult to manipulate a heavy engine and pylon assembly using a fork lift with the precision required to avoid damaging the aircraft. In the case of AA flight 191 aircraft, the rear pylon attachment point had been pressed up against the wing too hard, which created a fracture in the pylon's rear bracket. Over the next couple of months, this fracture widened with each take off and landing. When it finally failed, the engine's thrust pulled the entire assembly forward, rotating up and over the front edge of the wing. The engine/pylon took a chunk of the wing with it and cut the wing's hydraulic lines in the process. Inspection of other DC-10 planes after the crash revealed that similar damage had resulted from similar short-cut procedures used by both American and Continental Airlines.

... they provided a safer procedure in the manual. But for McDonnell Douglas, this was little comfort when all DC-10's in the US were grounded for 37 days.

Clearly, the majority of responsibility for the flight 191 accident lies with the airline maintenance staff, since they didn't follow the recommended procedure. The aircraft engineers at McDonnell Douglas may very well have anticipated the potential problems with trying to detach the pylon from the wing with the engine still attached, which is why they provided a safer procedure in the manual. But for McDonnell Douglas, this was little comfort when all DC-10's in the US were grounded for 37 days. This caused huge problems for the company in a competitive aircraft market. It was little comfort to the victims and those affected by the crash. Everyone loses in these situations, even those who are "right" about a seemingly arcane technical issue.

Lessons about People and Process

If software security is about People, Process and Technology, as espoused by Schneier, then these kinds of issues seem to fall squarely in the People and Process categories. Especially when technical pitfalls are documented, it is easy for engineers that are knowledgeable in a particular area to develop ivory tower syndrome and take the stance: "I told you not to do it that way, but if you want to shoot yourself in the foot, by all means..." But if our goal is to provide end-to-end safety or security, then this mentality isn't acceptable. As it turns out, there are things engineers can do, besides just documenting risks, to avoid People and Process problems with Technology. This is certainly not always the case: some problems simply cannot be addressed with Technology alone. But many can be mitigated if those problems can be anticipated to begin with.

Typically in software, the downsides of failure are not nearly as serious. However, the kind of displaced fallout that McDonnell Douglas experienced also shows up in software security. One example would be with open source blog software packages, such as WordPress. In a number of discussions I've had with clients and security folk, the topic of WordPress security has come up. Everything I hear indicates that WordPress has a pretty poor reputation in this area. In one way, this seems little odd to me, since I have briefly looked at the core WordPress code base a few times and they do a lot of things right. Sure, WordPress has its share of security issues, don't get me wrong, but the core software isn't that terrible. However, if you do a CVE search for WordPress, the number of vulnerabilities associated with WordPress plugins is quite depressing. To me, it is apparent that bad plugin security has hurt WordPress' reputation around security in general, despite the majority of vulnerabilities lying somewhat out of the core developers' control.

Two primary ways that engineers can help guide their technical customers (whether they be other programmers or maintenance crews) down a safe path: discourage dangerous usage and make safe usage much easier than the alternatives.

Discouraging Dangerous Usage

Let us return to the issue of mechanics trying to remove the engine and pylon assembly all in one piece. If the McDonnell Douglas engineers anticipated that this would be unsafe, then they could have made small changes to the engine/pylon assembly such that when the engine is attached, some of the mounting bolts between the pylon and wing were covered up. In this way, it becomes technically infeasible (short of getting out a hack saw) to carry on with the procedure that the airlines devised.

In the case of WordPress, if the core developers realized that many plugin authors keep making mistakes using, say, an unsafe PHP function (there are soooo many to choose from...), then perhaps they could find a way to deploy a default PHP configuration that disables the unsafe functions by default (using the disable_functions option or equivalent). Sure, developers could override this, but it would give many developers pause as to why they have to take that extra step (and then perhaps more of them would actually RTFM).

Making Safe Usage Easier

Of course, disabling features or otherwise making life difficult for your customers is not the best way to make yourself popular. A better way to encourage safety by developers (or mechanics) would be to devise faster/better solutions to their problems that are also safe. In the case of the airline mechanics, once McDonnell Douglas realized that three airlines were using a short-cut procedure, then they could have evaluated the risks of this and devised another procedure that was both fast and safe. For instance, if they had tested United's method of using a hoist (rather than a fork lift), they may have realized that a hoist is perfectly fine and encouraged the other two airlines to use that method instead. Or perhaps they could have provided a special protective guide, harness, or special jacks that would allow for fine control over the engine/pylon assembly when manipulating it.

In the case of WordPress, instead of just disabling dangerous interfaces in PHP, they could also provide alternative interfaces that are much less likely to be misused. For example: database access APIs that don't require developers to write SQL statements by hand, or file access primitives that make directory traversal impossible within a certain sub-tree. Of course it depends on the kinds of mistakes that developers keep making, but by adding APIs that are both safe by default and that save developers time, more and more of the developer population will gravitate toward safe usage.

Conclusion

Once again, it is easy to pass the buck on these kinds of problems and assume, as an API designer, that your users' poor choices are out of your control. It is also easy to assume that your users are just as technically savvy as yourself and won't make mistakes that seem obvious to you. But these are both bad assumptions and should be constantly questioned when it comes to ensuring the security of the overall system.