For the last few days I've been debugging Zemanta's system a lot. Besides noticing how poor our logging practices really are, I've also noticed that our logs are full of warnings, errors, and exceptions that nobody cares about. We have a complex service running composed of many different parts. Each of this parts is keeping a separate log file that resides on a server somewhere. A comment by one of our Ops people is quite telling in describing the situation:
Oh, we have a log for this application.
I won't reveal which application this is, but it is one of our more important applications. It seems that in the case of logging, the saying "out of sight, out of mind" quite accurately characterize usual state of affairs.
I've been thinking for quite some time now, how we could improve our logging situation. One obvious way is to go regularly over all logs and manually check for anomalies. But if you prefer to spend your time on more creative endeavors, the log checking process should be automated in either push or pull fashion. Unfortunately, log aggregation seems to be quite a neglected area. I only know Facebook Scribe as a push log aggregator, and Splunk or Loggly as pull log aggregators, but I don't have any real life experience with any of them. Therefore, I'd really like to hear from you what's your take on log aggregation and if you have some first hand experience with it yourself.