Logging (Week 12, Monday, March 31)

Logging defined

Basic idea

python logging.debug('This message should go to the log file') logging.info('So should this') logging.warning('And this, too')

python 2010-12-12 11:41:42,612 DEBUG:root:This message should go to the log file 2010-12-12 11:41:43,015 INFO:root:So should this 2010-12-12 11:42:35,756 WARNING:root:And this, too

When you record it?
What information do you record?
Do you have levels (DEBUG, INFO, WARNING)?
When do you turn levels on and off?
How do you analyze the logs?

When do you use it?

From Python 3 tutorial

When to use logging
Task you want to perform	The best tool
Display output for ordinary use of a command-line program	`print()`
Report events from normal operation	`logging.info()` or `logging.debug()`
Issue a warning regarding an event	`logging.warning()` if there is nothing the application can do
Report an error from a specific event	Raise an exception
Report suppression of an error in a long-running process	`logging.error()`, `logging.exception()`, or `logging.critical()`, as appropriate

Importance of logging

From 20 Obstacles to Scalability, p. 58:

Number 7: Insufficient monitoring and metrics

“it should be so basic you cannot imagine working without it”

Number 10: Insufficient logging

“You may enable a lot more of it when you are troubleshooting and debugging, but on an ongoing basis you will need it for key essential services”

From On Designing and Deploying Internet-scale Services, pp. 231–232:

Log everything all the time

From Log Everything All the Time.

For highly-available applications

Log everything
Log all the time
Only have two levels, NORMAL and DEBUG
Turn on debugging per-module
Every event should include the id of the customer request that started it:
- Customer S. Lee requested a resize of image ‘my-vacation-july-24-444.jpg’ => Request QX3567187

2014-02-12 11:41:42,612 root:QX3567187:Resize from S. Lee of 'my-vacation-july-24-444.jpg' started 2014-02-12 11:41:43,015 root:QX3567187:Resize saved in S3 entry 'lee-mvj24-3617846.jpg' 2014-02-12 11:41:43,212 root:QX3567187:Resize sent to instance EC2-Q347HN for 100 by 100 resize 2014-02-12 11:42:35,756 root:QX3567187:Resize completed by EC2-Q347HN

Keeping it efficient

Set up fast queue between high-priority worker process and low-priority logging process

Logging process does slower formatting operations
Allocating/deallocating queue buffers must be fast
Logger pushes to permanent storage in the background

Any object should be easily dumped to the log

logging.dump(myobj)

Analyzing the logs

Products such as loggly integrate logs from multiple sources and analyze them.

Guide to reading for next class

Read the following two short sections from F1: A Distributed Database that Scales:

Section 1: Introduction (pp. 1068–1069, not including “2. Basic Architecture”).
Section 10: Latency and Throughput (p. 1078, not including “11. Related Work”).

Key points: Most of the paper is concerned with database topics that are outside the scope of this course. However, the two sections I selected respond to two themes of the course:

The relation between scalability, availability, and latency. The F1 team claims to have found a unique design point in that space.
Latency, replication, and distribution across data centres. The Paxos algorithm they mention is a quorum algorithm that requires a quorum of instances to be available in order to run.