Game Days (Week 11, Friday, March 28, 2014)

Game day intro

From Weathering the Unexpected.

Annual, company-wide, event over several days

Given that every team is ready and on-site, problems can be resolved quickly

Game Day complements continuous component testing.

Example: A data centre is partitioned by failure of a switch

Example: Simulated earthquake

Caused Failure of Bay Area Google data center

What to test

Failures of multiple systems in parallel

Identify weaknesses in less-tested interfaces between systems and groups

Be ready to turn off the test if serious problems arise

Command centre

Technical team

Coordinators

Example: Phone “bridge” emergency coordination

First test: Only one person figured out how to connect

Second test: 100 people connected … but bridge only held 40 connections

Third test: A caller put the bridge on hold … no one else could call in or out

Example: Buying diesel fuel for long-term emergency generators

Simulated long-term power outage for a data center

Needed to buy extra diesel fuel for backup generator ($$$)

No authorization possible from central management (comm failure)

Expected employees to use (documented) emergency spending procedure

Instead: Charged > $100,000 on personal credit card

Example: Zombies Attack!

One data centre goes down abruptly (Atlanta)

The zombies are fiction (… maybe?) but the shutdown is real

  1. Other three centres take up slack
  2. Performance SLA maintained
  3. But resilience SLA requires running with only two data centres
  4. So they cleanly shut down another (Europe)
  5. Now barely within performance SLA

Growing the organization

Start off with a few volunteer teams

First tests can be quite safe

Don’t expect a team’s first tests to really test much

Increase complexity and breadth of tests with each run

Embrace failure as a means of learning

Track every problem

Solid champion amongst senior executives aids process (VP Operations at Google)

Guide to reading for next class

Read Log Everything All the Time.

Key points:

  1. Why logging everything is necessary
  2. Why you need to build logging in rather than bolt it on later
  3. Why it needs to be fast

The post ends with a long list of bullet points for logging design. Each entry is terse and there are many. Get a general gist of these guidelines but don’t expect to memorize every one.