xFilesFactor and Graphite Whisper


In recent months we started to use Graphite for collecting measurements about performance of our system. Beforehand, we used Munin and some custom solutions to store and collect data, but Graphite is so powerful and easy to use (especially when coupled with Statsd) that we started to use it for all our measuring needs. But as is the case with any new piece of technology that you introduce to your system, also Graphite has its idiosyncrasies that one discovers only after using technology for some time. Thus, I've spent the whole yesterday's afternoon and evening, trying to discover why my nice graphite chart displaying index statistics for the past month abruptly ends 7 days ago. And when I've zoomed the chart out to see the statistics for the past year, the chart abruptly ended 30 days ago. Due to this symptoms I immediately knew that the problems is somehow connected with Graphite's way of storing and aggregating data. Namely, Graphite stores data in a very interesting way. Instead of having a dynamic data structure that could accomodate varying amount of data, Graphite stores time series data into a static data structure, thus enormously improving performance. But naive implementation of static data structure would either require sacrificing sampling rate or lots of storage. Graphite solves this problem by acknowledging that usually you only need detailed measurements for recent past, while you only need aggregate values for earlier times. The data structure used by Graphite is therefore an array of arrays with data stored at different granularity and with different retention rates. For example, we typically use the following retention settings

retentions = 1s:24h, 10s:7d, 10m:30d, 1h:5y

This setting indicates that we store one datum for every second for the past 24 hours, one datum for every 10 seconds for the past 7 days, one datum for every 10 minutes for the past 30 days, and one datum for every hour for the past 5 years. These settings translate into 86400, 60480, 4320, and 43800 data points, respectively, that together take 2.3MB of storage.

When using Graphite one should really understand the storage policy and how data is aggregated when moved from one retention interval to another. In order to solve my problems, I dug deep into code of Whisper (Graphite's specialized database system) and analyzed how the data is really stored and aggregated. At the end, I've learned about the xFilesFactor setting, that specifies "what fraction of the previous retention level’s slots must have non-null values in order to aggregate to a non-null value." I've stored very sparse data to Graphite that got aggregated to a null value with the default xFilesFactor setting of 0.5. Fortunately, Graphite has also very good management tools, so issuing

python whisper-resize.py /path/to/metric.wsp 1:86400 10:60480 600:4320 3600:43800 --xFilesFactor=0.0

resolved my problems and my chart now looks awesome at all time resolutions.