Wednesday, September 05, 2007

It's All About the Traceability

I just come across a very interesting blog Log Everything All the Time talking about logging information in a production environment.

As a software developer, I just want to further elaborate on this:

1. It is all about the Traceability!

  Since in the real production environment, there is always some failure somewhere: router/switch could die or rebooted, connections will timeout if the firewall is jammed, SSL Certificates will expire, and there'll be DB upgrade, OS Patching, ... ..., the failure could happen just at the time a new merchant customer just send out his second credit authorization request, or approval of a financial transaction with $10,000 just arrived, ... ... finally, you'll grab all the forensic information you could find to diagnose.

2. All these logging are neither for Operation Team nor for Support Engineers, they are actually for the developers themselves.

  Borrowing a word our Infrastructure Architect often said: "if there is anything you wish to see when you get called at 2AM, you should log it for yourself"

3. Strong Infrastructure Support

  Obviously, to make "Log Everything All the Time" work, you need a very powerful Logging Server/Bus to consume all these generated log info, you also need some powerful Log Miner or Search Tool to help you query, filter and correlate. Fortunately, in my current working enrionment, we have these kind of infrastructure available so we have no excuse but write logging code.

  Given a production environment with firewall, load balancer, front service gateway, Application Server, and backend DB, if it is not feasible to really log everything, you should at least log any request/response coming in and out any system/component, any critical state change in any system/component, and any state change in any system/component if it is possible. Again, it is all about the traceability.