Jul 7, 2019

Introduction to Glaber

Prologue

Glaber is opensource scalable high-performing NMS based on Zabbix which created with modern realities in mind.
Since Zabbix team busy by creating experts on changing colors in css files and have no time to accept patch for clickhouse support — Mikhail Makurov and his team created a fork which was a result of many patches introduced new features like:

Clickhouse as primary storage
High-Availability clusters
Asynchronous pollers (including agents)
Nmap support for simple checks
Domain-based naming for hosts

Clickhouse

Clickhouse is opensource distributed column-base DBMS.
It was created by Yandex for their Analytics service and implemented in hundreds of companies with positive feedback.
Glaber used Clickhouse as storage for history and trends.
Because of that decision you can achieve much faster speed (up to 100 times), reduced disk size (up to 50 times) required for the same amount of metrics and less CPU usage at all.
You can calculate required disk size for your system using next data:

1 billions metrics without nanoseconds require 3.5-5 GB.
Trends calculated by Clickhouse automatically every hour so keep in mind +10-15% of history size.

As you may see in worst case database size will be ~6GB for 1B metrics (absolutely amazing, right?). Compare it with what you currently have.

High-Availability

Added support for high availability by creating cluster with up to 64 servers (later we’ll change that value to 256 or even more) allow you reach stability with unbelievable performance.
When cluster created each server takes part of hosts and later will poll only those hosts until topology changed.
In case when one server died and didn’t respond for simple check - all other servers dividing between themselves his hosts and keep polling.
And most importantly, since each server can handle up to 50-60K NVPS you can easily get 3KK NVPS in current version and up to 15KK NVPS in near future.
NOTE: it still experimental feature. Please don’t use in production yet.

Asynchronous pollers

One of primary features is asynchronous polling for snmp & agents.
Each async poller may handle up to 800 metrics/second.
This became possible through multiple processing of tasks for polling.
In one iteration poller threads take up to 4-8 thousand tasks instead of one.
To estimate the number of threads, we can assume that the speed of such pollers is about 150 times faster than normal.
For example, if 200 threads were used on the server and 200 streams on two proxies, in total, this is 600. So 4 asynchronous streams will be required on the server.

Nmap

By default in Zabbix for simple check used fping.
Because of fping slowness, Glaber added support for nmap.
Nmap allows you to “ping” hosts 40 times faster but it don’t show percentage of loss.
Think twice before using it in production if you have a bunch of hosts with a significant percentage of packet loss.
So this is an experimental feature and later we’ll probably move to zmap.

Domain-based naming

If you ever had a situation when hosts id changed for some reason — this feature is for you.
Added additional column with hostname in history table care about “history loss” (because of different id).
The situation, when id of hosts changed, is quite rare (like something happened to dbms and you need to re-upload all hosts though api) but they’re exist, let be honest.

Future plans

Add zmap support for simple check.
Change the way of coordination for HA (thinking about solution based on etcd)
Add the ability to write extensions using Python\Go.