transmissions from a free roaming agent of kaos: Conclusions from Betfair's Outage

Niall Wass and Tony McAlister of betfair recently published a summary of betfair's 6 hour outage on 12 March 2011. What follows is a review of their analysis.

Most of betfair's customers will have no idea what Niall and Tony are talking about. "This [policy] should give maximum stability throughout a busy week that includes the Cheltenham Festival, cricket World Cup and Champions League football" is about the only non-tech part of the article that their customers can relate to. However, for us technologists, the post provides some tasty detail for us to learn from other's mistakes.

The post is consistent with a growing and positive trend of tech oriented companies disclosing at least some technical detail of what happens to cause failures and what is to be done about it in the future. Some benefits from this approach:

1. Apologize to your customers if you mess them about - always a good thing to do when you mess them about (Easyjet or Ryan Air - I hope you're reading this). Even better is to offer your customers a treat - unfortunately betfair only alluded to one and didn't provide concrete commitment.

2. Give public sector analysts some confidence that this publicly traded company isn't about to capsize with technical problems

3. Receive broad review and possibly feedback about the failure. Give specialist suppliers a chance to pitch to help out in potentially new and creative ways.

4. As a way to drive internal strategy and funding processes in a direction they otherwise might not be moving.

Level of change tends to be inversely proportional to stability. "In a normal week we make at least 15 changes to the Betfair website…". This is a powerful lesson that many non-tech people do not understand - the more you shove change into a system, the more you tend to decrease it's stability. This statement also tips us that betfair has not adopted more progressive devops and continuous delivery trends to more safely pushing change into production.

The change control thinking continues with "… but we have resolved not to release any new products or features for the next seven days". This is absolutely the right thing to do when you're having stability issues. Shut down the change pipeline immediately to anything other than highly targeted stability improvements. Make no delivery of new features a "benefit" to the customer (improved stability) and send a hard statement to noisy internal product managers to take a deep breath and come back next week to push their agenda.

Although betfair might not be up on their devops and continuous delivery, they have followed the recent Internet services trend of being able to selectively shut down aspects of their service to preserve other aspects:

- "we determined that we needed our website 'available' but with betting disallowed"

- "in an attempt to quickly shed load, we triggered a process to disable some of the computationally intensive features on the site"

- "several operational protections in place to limit these types of changes during peak load"

Selective service shutdown is positive, it hints that:

1. The architecture is at least somewhat component based and loosely coupled.

2. There is a strategy to prioritize and switch off services under system duress

The assertion that betfair spent several hours verifying stability before opening the site to the public suggests bravery under fire. "We recovered the site internally around 18:00 and re-enabled betting as of 20:00 once we were certain it was stable". There must have been intense business pressure to resume earning money once it appeared the problem was solved. However, during a major event, you want to make sure you're back to a stable state before you reopen your services. A system can be in a delicate state when it is first opened back up to public load levels (e.g., page, code and data reload burden) which is one reason why we still like to perform system maintenance during low use hours so that the opening doors customer slam when the website/service opens are at least minimized.

The crux of the issue appears to be around content management, particularly web page publication. Publishing content is tricky as there are two conditions that should be thoughtfully considered:

- Content being served while it is changing which results in "occasional broken pages caused by serving content" and here-and-gone content where content has been pushed to one server, but not another

- Inconsistency between related pieces of content (e.g., a promotional link on one page pointing to a new promotion page that hasn't been published yet)

It appears that betfair's content management system (CMS) is not async nor real time: "Every 15 minutes, an automated process was publishing…". Any time a system is designed with hard time dependencies is a timebomb waiting to go off, with the trigger being increasing load: "Yesterday we hit a tipping point as the web servers reached a point where it was taking longer than 15 minutes to complete their update". A lack of thread safe design is another indicator of a lack of async design that tends to enforce thread safety: "servers weren't thread-safe on certain types of content changes". A batch, rather than on-demand approach is also symptomatic of the same design problem: "Unfortunately, the way this was done triggered a complete recompile of every page on our site, for every user, in every locale". Therefore likely not an async on-demand pull model but rather a batch publish model.

The post concludes with a statement of what has been done to make sure the problem doesn't happen again:

1. "We've disabled the original automated job and rebuilt it to update content safely" - given the above design issues, while thread safety may have been addressed, until they address the fundamental synchronous design, I'd guess there will likely be other issues with it in the future.

2. "We've tripled the capacity of our web server farm to spread our load even more thinly" - hey, if you've got the money in the bank to do this, excellent. However, it probably points to an underlying lack of capacity planning capability. And of course, everyone one of those web servers depends on other services (app server, caches, databases, network, storage, …) - what have you done to those services by tripling demand on them? Lots of spare capacity is great to have, but can potentially hide engineering problems.

3. "We've fixed our process for disabling features so that we won't make things worse."

4. "We've updated our operational processes and introduced a whole new raft of monitoring to spot this type of issue." - tuning monitoring, alerting, and trending system(s) after an event like this is crucial

5. "We've also isolated the underlying web server issue so that we can change our content at will without triggering the switch to single-threading"

And here are my lessons reminded and learned from the post:

- If you're having a serious problem, stop all changes that don't have to do with fixing the problem

- Selective (de)activation of loosely coupled and component services is a vital feature and design approach

- Make sure the systems are stable and strong after an event before you open the public floodgates

- Synchronous and timer based design approaches are intrinsically dangerous, especially if you're growing quickly

- Capacity planning is important, best done regularly, incrementally and organically (like most things), not in huge bangs. One huge bang now can cause others in the future.

- Having lots of spare capacity allows you avoid problems… for awhile. Spare capacity doesn't fix architectural issues, just delays their appearance.

- Technology is hard and technology at scale is really hard!

Niall and Tony, thanks for giving us the opportunity to learn from what happened at betfair.

transmissions from a free roaming agent of kaos

18 March 2011

Conclusions from Betfair's Outage

1 comment: