TICK tack TIG - telegraf, InfluxDB and Grana - A journey

Posted on Sat 05 May 2018 in monitoring

For years collectd in combination with monit has been my preferred monitoring stack when it came to my private systems. So we're talking about this webserver, my mailserver, a Raspberry Pi and so on.

There have been issues with that setup. monit has been quite noisy in some situations. My Raspberry Pi sometimes send a few dozen mails because ssh has been unresponsive for a few minutes. Tweaking the settings consumed quite some time for basically no really useful output. All services mainly ran as expected and even if one failed it hasn't been such a big deal.

For collectd it's been way different. As it stores data mainly in RRD files with less data being available for the past trends couldn't be observed. On the other hand combining it with other data storages didn't seem to be right to solve this issue.

So a few months ago when I had a bit of time in February I evaluated different solutions. In the end I decided to start with the TICK stack by InfluxData. It's Open Source and as InfluxData requires constant developing of the stack for their services the Release Notes showed so many great, new features. It sounded like the future.

TICK stack

This stack consists of four components:

  • telegraf - monitoring agent running on a server and gathering data, e.g. system load and memory usage
  • InfluxDB - time-series database basically built for the stack. With simple SQL like queries!
  • Chronograf - visualization web-frontend with administrative functions
  • Kapacitor - alerting manager with quite some add-ons

For a basic setup I can recommend the DigitalOcean tutorial for CentOS 7

After setting up the stack and monitoring one system Chronograf revealed an issue. One I couldn't solve my own: It lacks of an active community sharing dashboards. Chronograf comes with some dashboards for telegraf input plugins. These should be added automatically which only worked sometimes. Building own dashboards is quite some work. Work I didn't want to do for nearly everything. It's still my spare time and without an option to centrally dhare dashboards easily and giving the community something back Chronograf didn't feel right.

So where to move next?

TIG tack

Grafana has been around for years when it comes to visualising time-series data. It's mentioned on every article nowadays and I believe everyone has seen it before: Ever been on Congress and used the Dashboard?

With Grafana it was really easy to explore Dashboards. I'll share some of my favourites later.

= Making it work in real life

The setup guide above is, as always, only a start. It doesn't use SSL and didn't reflect my setup in many terms.

Base Setup

  • Fedora 27 LXC container on a physical server (let's call it lxc-tig) - running telegraf, InfluxDB and Grafana
  • Internal LXC interface only
  • Host system with IPv4 and IPv6 addresses
  • LetsEncrypt SSL certificates
  • firewall-cmd on host system

SSL termination

InfluxDB

InfluxDB must be reachable by all clients and preferrable encrypted. This lead to various issues in my setup:

  • SSL certificates are generated on the host system
  • Generation of a LetsEncrypt ccertificate on lxc-tig would require forwarding port 80/tcp, which is being used by this webserver
  • Copying certs to the lxc container is a sloppy workaround
  • Control over protocols and ciphers is not possible on InfluxDB

Even with the easy InfluxDB SSL setup all the above points kept me from terminating SSL at InfluxDB.

After thinking about alternatives and failing because of technical restrictions I decided to use haproxy for ssl termination and forwarding the traffic to lxc-tig and therefore the InfluxDB.

Keep in mind: Forwarding traffic to InfluxDB in an existing vhost with a context path won't work.

haproxy

haproxy is simple, fast and easy to setup... uhm. No. Actually that's not true. My haproxy configuration (/etc/haproxy/haproxy.cfg) looks like that:

global
    log         127.0.0.1 local2

    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     50
    user        haproxy
    group       haproxy
    daemon

    stats socket /var/lib/haproxy/stats

    ssl-default-bind-ciphers ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA384
    ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets
    ssl-default-server-ciphers ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-SHA384
    ssl-default-server-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets
    tune.ssl.default-dh-param 4096

defaults
    log                     global
    option                  dontlognull
    option                  redispatch
    retries                 3
    timeout http-request    10s
    timeout queue           1m
    timeout connect         10s
    timeout client          1m
    timeout server          1m
    timeout http-keep-alive 10s
    timeout check           10s
    maxconn                 3000

# influxdb on lxc-tig
frontend influxdb
    bind *:8086 ssl crt /etc/letsencrypt/certs/monitoring.erinnerungsfragmente.de/fullchainprivkey.pem
    mode http
    http-response set-header Strict-Transport-Security max-age=15768000
    default_backend         influxdb

backend influxdb
    mode                    http
    server                  upstream lxc-tig:8086

Let's quickly walk through it:

  • global section defines some base settings. There's no need to change those - in the global section SSL settings can be defined. The ones above are my settings and should lead to a great result with testssl.sh
  • defaults sections applies to all the front- and backends. There's no need to change those as well
  • frontend influxdb - this frontend is listening on port 8086/tcp and is using my LetsEncrypt certificate
  • backend influxdb - this one only defines the http connection to my InfluxDB on lxc-tig

To resolve the issue with the cert file for haproxy I've written a small dehydrated hook and published it on GitHub. In your dehydrated config you can call the script with the HOOK option.

You may want to add haproxy service restart in your hook.sh.

Grafana

Grafana will listen on port 3000/tcp by default. nginx can be used to forward traffic. Within my vhost I've added the following:

    location / {
        proxy_ignore_client_abort on;
        proxy_pass http://lxc-tig:3000;
        proxy_set_header Host $host;
    }

Inside /etc/grafana/grafana.ini on lxc-tig I've changed server-section as follows:

[server]
protocol = http
http_port = 3000
domain = monitoring.erinnerungsfragmente.de
;enforce_domain = true
root_url = https://%(domain)s/
router_logging = false

That's it. (Pretty simple compared to InfluxDB, isn't it?)

Grafana configuration

After setting up Grafana a datasource definition is required. This is how Grafana will read data from InfluxDB. As both Grafana and InfluxDB are running inside the same LXC container my connection config is simple:

grafana influxdb connection

telegraf configuration

All my telegraf clients are using the haproxy connection which is forwarding traffic to my InfluxDB. As I'm using a valid certificate the endpoint configuration inside my /etc/telegraf/telegraf.conf on all clients doesn't require many arguments:

[[outputs.influxdb]]
  urls = ["https://monitoring.erinnerungsfragmente.de:8086"] # required
  database = "telegraf"
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"
  username = "someuser"
  password = "supersecretpassword"
  user_agent = "telegraf"
 

What's next

Hopefully I'll have some time to share my dashboards and additional configurations. :)