Developing a Service-Level Agreement (SLA)

Solution

Thursday, Feb. 20: The answer key.

Submission details

Due: Wednesday, February 19, 2014
Submit to CourSys: A PDF file giving your answers to each question. Show how they were derived.
Percentage of course grade: 4%

Part 1: SLA for Latency

This assignment asks you to work through the issues discussed in The Tail at Scale. We will use the structure of Assignment 2 as an example application.

The latency of a request in Assignment 2 is due to the latencies of each of its component parts: The server (which saves the original image in S3 and queues up resizes on the workers), the time it takes the SQS queue to deliver messages from the server to the workers, and the workers (which do one or more resizes and save them in S3). Assume the following characteristics for each component:

Latency for 2 workers

Latency percentiles for w = 2 workers
PercentileTime (ms)
50.0125
99.9175

Example: If two workers are creating thumbnails and every request requires each worker to make a thumbnail, 50% of the time all thumbnails for one request will be completed within 125 ms after the workers have received them and virtually all requests will have all their thumbnails ready 175 ms after the workers have received them.

Latency for 1000 workers

Latency percentiles for w = 1000 workers
PercentileTime (ms)
50.0150
90.0325
99.0650
99.91050

Question 1

Assume the latency for a complete request is the sum of the server, queue, and worker latencies. This is the time between the receipt of a request from a user task and all the thumbnails being stored in S3 and available to be read.

Using the above formulas, calculate the 99.9th percentile latency for requests when there are two workers.

Question 2

Using the same assumptions as above, but now asssuming 1000 workers and that every request will require creation of 1000 thumbnails, one for each worker, calculate the 99.9th percentile latency.

Question 3

Added Mon, Feb 17: For this question, continue the assumption that you have 1000 workers, as done in Question 2.

Now assume that you have revised your project to use a hedged request algorithm. At the 99th percentile time, for every worker that has still not replied, you start a second worker with the same request. Assume you have a pool of idle workers from which to assign the duplicate request. Assume that a duplicated request has the latency distribution of 2 workers. (This is unrealistically simple but it makes the computation easier.)

What is the 99.9th percentile of this latency distribution?

Part 2: SLA for Throughput

Added Mon, Feb 17: This question does not include any of the assumptions from Part 1. Answer it only using the assumptions given in this part. Each EC2 instance can only do one request at a time and every request requires only a single EC2 instance.

Assume that your latency computations make you comfortable setting an SLA of 99% at 1400 ms for all your requests. You have reserved four EC2 instances that will be used exclusively for this service (in other words, single-tenant). These instances are identical to the one used for the latency computation.

What is your SLA for throughput?

part 3: SLA for Availability

Assume that Amazon offers an availability of “four nines” (99.99%). In addition, your operations staff tell you that they will likely need to have the system down for one working day (eight hours) per year.

How many “nines” is your SLA for annual availability?