Building reliable systems (Week 11, Monday, March 24)
Reliability for large-scale systems
Experiences of James Hamilton, then at Windows Live. (On Designing and Deploying Internet-scale Services)
Three main principles (Bill Hoffman):
- Expect failures
- Keep things simple
- Automate everything
Application design
Keep development, test, and operations staff close to each other
- DevOps approach: “You built it, you support it”
Failure recovery must be
- Simple
- Tested frequently
- Perhaps hard-fail systems to shut them down
Redundancy
- Availability of four to five 9s require redundant systems
- True redundancy means you can hard-fail any system without work loss
Support and deploy only one version
Multi-tenant your services
- Share hardware cost across all services
- Give up some system efficiency for operations efficiency
Allow rare human intervention in an emergency
- Test the scripts with fire drills or Game Days
Admission control—block new work when system is overloaded
Shard the data
- Arbitrarily adjustable (not just first letter of family name)
- Use a hash or dictionary to map primary keys to shards
- Not bound to any real-world object (not “all employees of organization X”)
Analyze throughput and latency
- With background tasks running (garbage collection, etc.)
- Predict and measure access patterns (especially to databases, but also network and other components)
- Every call to a subservice should have a timeout
- Bound the number of restarts
- Log restarts
- Dependent (lower-level) servics should commit to the same overall SLAs as their callers
Automatic management
Make your secondary backups synchronous
- The write is considered complete when both copies are written
- Cost in longer latency and lower throughput
- You can instantly route requests to the backup when the primary fails
Build in recovery at higher levels, not lower
- The service should have redundant messages
- Don’t conceal it lower, at the network protocol
- TCP doesn’t guarantee reliability across sessions
- Fail fast
Fail services regularly—see next day’s class
Operations
Never delete anything
* Just mark it deleted
* Make everything configurable (“Feature flags”)
The “Big Red Switch”
- Turns off all non-essential services
- Use it when the system is melting down (or about to)
- Resetting the switch should restart stopped services automatically
Control admission into the service
Meter admission into the service when coming up after a failure
Guide to reading for next class
Read all of
The Antifragile Organization.
This article describes Netflix’s approach to automatically generating
failures of system components. They have created a powerful culture of
stress-testing their systems and learning from any failures.