Game Days (Week 11, Friday, March 28, 2014)
Game day intro
From Weathering the Unexpected.
Annual, company-wide, event over several days
- DiRT: Disaster Recovery Training (or “Game Day”)
- Expensive to run
- Has caused actual outages leading to revenue loss
Given that every team is ready and on-site, problems can be resolved quickly
Game Day complements continuous component testing.
Example: A data centre is partitioned by failure of a switch
- Recovery requires both DNS and directory services to failover (switch to backups)
- The individual failovers should be tested regularly
- Game Day tests whether all the individual failovers work together
Example: Simulated earthquake
Caused Failure of Bay Area Google data center
- Housed a number of internal services
- Some of those services could not recover (single home, single point of failure)
- Other services failed over to their local workstations
- Internal authentication services failed
- Most Google staff now couldn’t work
- Everyone went to dinner
- Local cafes Denial-of-Service’d
- System for redirecting pages and alerts to non-Bay Area offices also failed
- Calls reporting errors were not received
- Accounting approvals-tracking did failover
- But the people who could approve the new topology were themselves cut off
- So the automatically reconfigured system was useless
What to test
Failures of multiple systems in parallel
- Loss of a data centre
- Backbone fiber cable cut
- Failure of core infrastructure on which other systems depend (Colossus file system, authentication, …)
Identify weaknesses in less-tested interfaces between systems and groups
- Design by team with members from all affected groups
- Multiple teams collaborate to resolve problem
- Important to have them all there and ready
Be ready to turn off the test if serious problems arise
Command centre
Technical team
- Design cross-service tests
- Evaluate tests by individual teams
- Cause large-scale outages
- Monitors test progress
- Manages recovery when things go “worng”
Coordinators
- Plan and schedule tests
- Ensure preparation done
Example: Phone “bridge” emergency coordination
First test: Only one person figured out how to connect
Second test: 100 people connected … but bridge only held 40 connections
Third test: A caller put the bridge on hold … no one else could call in or out
Example: Buying diesel fuel for long-term emergency generators
Simulated long-term power outage for a data center
Needed to buy extra diesel fuel for backup generator ($$$)
No authorization possible from central management (comm failure)
Expected employees to use (documented) emergency spending procedure
Instead: Charged > $100,000 on personal credit card
Example: Zombies Attack!
One data centre goes down abruptly (Atlanta)
The zombies are fiction (… maybe?) but the shutdown is real
- Other three centres take up slack
- Performance SLA maintained
- But resilience SLA requires running with only two data centres
- So they cleanly shut down another (Europe)
- Now barely within performance SLA
Growing the organization
Start off with a few volunteer teams
First tests can be quite safe
Don’t expect a team’s first tests to really test much
Increase complexity and breadth of tests with each run
Embrace failure as a means of learning
Track every problem
Solid champion amongst senior executives aids process (VP Operations at Google)
Guide to reading for next class
Read Log Everything All the Time.
Key points:
- Why logging everything is necessary
- Why you need to build logging in rather than bolt it on later
- Why it needs to be fast
The post ends with a long list of bullet points for logging design. Each
entry is terse and there are many. Get a general gist of these guidelines
but don’t expect to memorize every one.