Building reliable systems (Week 11, Monday, March 24)

Reliability for large-scale systems

Experiences of James Hamilton, then at Windows Live. (On Designing and Deploying Internet-scale Services)

Three main principles (Bill Hoffman):

Application design

Keep development, test, and operations staff close to each other

Failure recovery must be

Redundancy

Support and deploy only one version

Multi-tenant your services

Allow rare human intervention in an emergency

Admission control—block new work when system is overloaded

Shard the data

Analyze throughput and latency

Automatic management

Make your secondary backups synchronous

Build in recovery at higher levels, not lower

Fail services regularly—see next day’s class

Operations

Never delete anything * Just mark it deleted * Make everything configurable (“Feature flags”)

The “Big Red Switch”

Control admission into the service

Meter admission into the service when coming up after a failure

Guide to reading for next class

Read all of The Antifragile Organization.

This article describes Netflix’s approach to automatically generating failures of system components. They have created a powerful culture of stress-testing their systems and learning from any failures.