Currently, I’m working for ISP in Ukraine with thousands of network devices which need to be monitored without effort.
Our stack includes Zabbix, Cacti and other traditional NMS but none of them works great when the number of metrics becomes big enough. That’s a problem which we trying to solve.
As you may already know Zabbix use MySQL/PostgreSQL as default DB to store history and trends data.
That’s okay until amount of events per seconds are lower than a few hundred and you don’t care about analytics.
In our case, we’ve been forced to reduce the poll interval to 5 minutes and store historical data maximum for one month.
That’s an extremely short-term and as a result we can’t provide good enough service for our customers at the access level. So I started looking for a solution.
Through hardships to the stars
Officially Zabbix support MySQL, PostgreSQL, IBM DB2, OracleDB.
None of those RDBMS was created to store time-series data.
We tried to switch from MySQL to PostgreSQL but the result was almost the same (suffers from metrics count). Even if you using RAID SSD, 3-node cluster configuration, upgrade servers hardware nothing significantly changes…
Previously I had some experience with InfluxDB, Clickhouse and think these TSDB could solve our problem easily, but there is no official wrapper for them yet.
Searching solution for our problem was trivial (but almost all recommendation based on RDBMS improvement and dividing Zabbix that only complicates maintenance).
Looking for Clickhouse wrapper I found an article by Mikhail Makurov.
His team had similar problems so they decided to create a patch to store historical data in Clickhouse.
Inspired by benchmarks of their fork (easily handle up to 50k NVPS) I decided to give a shot.
Distributed Zabbix with two instances, 5 proxy, up to 200GB for 1.5 billion metrics stored in MySQL and slowness at all.
One Zabbix server which consumes up to 8GB RAM with Clickhouse and frontend. Database size reduced up to 50 times and fetching speed also increased up to tens times!
Plus as a great bonus we using asynchronous pollers (each of them handles ~600-800 metrics/sec) and some other tweaks.
Will we use it in production? Absolutely yes!
Would I recommend to start using Zabbix for new installation just because of that patch? Probably not.
There exist a lot of modern solutions like Prometheus (pull-based) and Riemann (push-based) which is more flexible but this topic is for another article.
- Reduced database size and improved fetching speed.
- Async pollers (they’re awesome!).
- Nmap supports instead fping.
- No more proxy needed.
- Escaped IOPS hell.
- RAM leaks when store text data in Clickhouse (in our case ~200 MB/day) but Mikhail works on that bug. :)
- Still save nanoseconds. We just dropped that column and cut database by ~40%.
- No official support at the moment.
- No trends support for Clickhouse yet… It’s not a big deal since clickhouse retrieve millions metric less than in second.