

Finally, data retention with Prometheus is short term and we wanted to keep a longer retention period with downsampling. Scalability was also a concern with the need to shard manually, split the system per region and manage federation. Move to Prometheus: Although this project looked very promising we didn’t want to break compatibility for our consumers.We didn’t want to use Whisper files as it would be more complex to scale compared to using Cassandra.

Create a cluster with vanilla Graphite: We quickly dismissed this option because of potential performance limitations and the burden to manage many processes (Python is single threaded).While we steered toward BigGraphite we also assessed other options: Using Cassandra was also attractive because we already run large clusters and have all the necessary tooling (Rundeck jobs and Chef cookbooks mostly). We were confident to try BigGraphite as the team behind it is solid. This open source project is led by Criteo, it supports Cassandra as back-end for both data points and metrics metadata, instead of Whisper classical Graphite storage. Looking at possible candidates, BigGraphite quickly drew our attention. We also wanted to move Graphite on AWS, where all our services are hosted. We started to think about our options to have both high availability and scalability. We were anticipating an x3–4 increase in data points per second for the end of year and this would not scale. We were also reaching the limits of this hardware and experienced several small outages.The rest of our infrastructure is built to be highly available and we couldn’t achieve that here. First, it is a single point of failure for all our services monitoring.These dashboards are key to detect any regression during deployments & troubleshooting sessions. On the read side, we use Grafana to produce dashboards that are displayed in the office. But we also use Datadog for our backend systems and Cloudwatch for AWS managed infrastructure.

We mainly use Graphite to monitor our stateless autoscaling components that represents a large amount of hosts/metrics. Those metrics come from thousands of AWS instances over 3 regions over the world. This baremetal had 2 TB of SSD storage managed by a hardware RAID controller, 24 cores, 32GB RAM.Īt that time, our services were writing 5 million monitoring metrics on Graphite at a rate of 50 000 data points per second. This project started as we were using a single physical server to manage all our monitoring metrics on Graphite. If you want to know more about what we do at Teads and who we are you can have a look here.
#GRAFANA TIMESLICE NOT WORKING FULL#
TL DR we are now using the full Go Graphite stack.
