Microsoft SQL Server and collectd - what the heck?

Posted on Sun 10 December 2017 in how-to

You're running a Linux server, a webserver, maybe an application server? Everything is being monitored with collectd and the world is fine and peaceful. But wait a second! What about the database server? Sure enough PostgreSQL is your friend most times. But there are some cases where SQL Server is required. Beside with Microsoft working on SQL Server for Linux the issue of monitoring it with the tools we know becomes more important.

Gathering information

collectd oes have native plugins to monitor PostgreSQL or MySQL. For everything else the DBI plugin comes in handy.

It depends on libdbi, a project which doesn't seem to be very active. Additionally freetds is required. It's the driver we'll be using to connect to the SQL Server.

Beside those dependencies I've found a nice repo by Microsoft with a collectd configuration to monitor the system. What the heck? I'm still confused by their engagement in the community.

Testing the connection

freetds seems to be a bit tricky. First the connection to the server should be tested. We do this by using the tsql tool:

TDSDUMP=/tmp/freetds.log tsql -H <my server> -p 1433 -U sa

Take care of the options! If you're using the -S argument the port -p will be ignored. In general it's a good idea to turn on the logging by defining TDSDUMP as it's quite noisy and helpful to find issues.

Configuring freetds

freetds comes with a config file that should be altered after finding a working configuration (e.g. port). A new Server with an alias can be put at the end of the file. We'll use the alias servername from now on.

# vi /etc/freetds.conf
[servername]
    host = <dnsname/ip>
    port = 1433
    tds version = 8.0

Configure collectd

The queries by Microsoft need two additional data types which will be put in a separate file:

# vi /etc/collectd/types.sqlserver.db
sql_server_perf_stats    perf_stats_value:GAUGE:0:U
sql_server_wait_stats    wait_time_ms:DERIVE:0:U, waiting_tasks_count:DERIVE:0:U, max_wait_time_ms:GAUGE:0:U

This file will be referenced within the actual config file:

# vi /etc/collectd/sqlserver.conf
TypesDB "/etc/collectd/types.sqlserver.db"

<Plugin dbi>
    <Query "perfstats">
    Statement "select replace(rtrim(counter_name),' ','_') as counter_name, replace(rtrim(instance_name),' ','_') as instance_name, cast(cntr_value AS float) as cntr_value from sys.dm_os_performance_counters where (counter_name in ('SQL Compilations/sec','SQL Re-Compilations/sec','User Connections','Batch Requests/sec','Logouts/sec','Logins/sec','Processes blocked','Latch Waits/sec','Full Scans/sec','Index Searches/sec','Page Splits/sec','Page Lookups/sec','Page Reads/sec','Page Writes/sec','Readahead Pages/sec','Lazy Writes/sec','Checkpoint Pages/sec','Database Cache Memory (KB)','Log Pool Memory (KB)','Optimizer Memory (KB)','SQL Cache Memory (KB)','Connection Memory (KB)','Lock Memory (KB)', 'Memory broker clerk size','Page life expectancy')) or (instance_name in ('_Total','Column store object pool') and counter_name in ('Transactions/sec','Write Transactions/sec','Log Flushes/sec','Log Flush Wait Time','Lock Timeouts/sec','Number of Deadlocks/sec','Lock Waits/sec','Latch Waits/sec','Memory broker clerk size','Log Bytes Flushed/sec','Bytes Sent to Replica/sec','Log Send Queue','Bytes Sent to Transport/sec','Sends to Replica/sec','Bytes Sent to Transport/sec','Sends to Transport/sec','Bytes Received from Replica/sec','Receives from Replica/sec','Flow Control Time (ms/sec)','Flow Control/sec','Resent Messages/sec','Redone Bytes/sec') or (object_name = 'SQLServer:Database Replica' and counter_name in ('Log Bytes Received/sec','Log Apply Pending Queue','Redone Bytes/sec','Recovery Queue','Log Apply Ready Queue') and instance_name = '_Total')) or (object_name = 'SQLServer:Database Replica' and counter_name in ('Transaction Delay'))"
    <Result>
            Type "sql_server_perf_stats"
            InstancesFrom "counter_name" "instance_name"
            ValuesFrom "cntr_value"
        </Result>
    </Query>

    <Query "waitstats">
        Statement "WITH WaitCategoryStats ( wait_category, wait_type, wait_time_ms, waiting_tasks_count, max_wait_time_ms) AS ( SELECT CASE WHEN wait_type LIKE 'LCK%' THEN 'LOCKS' WHEN wait_type LIKE 'PAGEIO%' THEN 'PAGE I/O LATCH' WHEN wait_type LIKE 'PAGELATCH%' THEN 'PAGE LATCH (non-I/O)' WHEN wait_type LIKE 'LATCH%' THEN 'LATCH (non-buffer)' WHEN wait_type LIKE 'LATCH%' THEN 'LATCH (non-buffer)' ELSE wait_type END AS wait_category, wait_type, wait_time_ms, waiting_tasks_count, max_wait_time_ms FROM sys.dm_os_wait_stats WHERE wait_type NOT IN ('LAZYWRITER_SLEEP', 'CLR_AUTO_EVENT' , 'CLR_MANUAL_EVENT', 'REQUEST_FOR_DEADLOCK_SEARCH', 'BACKUPTHREAD', 'CHECKPOINT_QUEUE' , 'EXECSYNC', 'FFT_RECOVERY', 'SNI_CRITICAL_SECTION', 'SOS_PHYS_PAGE_CACHE' , 'CXROWSET_SYNC', 'DAC_INIT', 'DIRTY_PAGE_POLL', 'PWAIT_ALL_COMPONENTS_INITIALIZED' , 'MSQL_XP', 'WAIT_FOR','DBMIRRORING_CMD','DBMIRROR_DBM_EVENT','DBMIRROR_EVENTS_QUEUE','DBMIRROR_WORKER_QUEUE', 'XE_TIMER_EVENT','XE_DISPATCHER_WAIT','WAITFOR_TASKSHUTDOWN','WAIT_FOR_RESULTS' ,'SQLTRACE_INCREMENTAL_FLUSH_SLEEP','WAITFOR' ,'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP','QDS_PERSIST_TASK_MAIN_LOOP_SLEEP','HADR_FILESTREAM_IOMGR_IOCOMPLETION','LOGMGR_QUEUE','FSAGENT' ) AND wait_type NOT LIKE 'PREEMPTIVE%' AND wait_type NOT LIKE 'SQLTRACE%' AND wait_type NOT LIKE 'SLEEP%' AND wait_type NOT LIKE 'FT_%' AND wait_type NOT LIKE 'XE%' AND wait_type NOT LIKE 'BROKER%' AND wait_type NOT LIKE 'DISPATCHER%' AND wait_type NOT LIKE 'PWAIT%' AND wait_type NOT LIKE 'SP_SERVER%') select  wait_category, cast(sum(wait_time_ms) as float) as wait_time_ms, cast(sum(waiting_tasks_count) as float) as waiting_tasks_count, cast(max(max_wait_time_ms) as float) as max_wait_time_ms from WaitCategoryStats where wait_time_ms >100 group by wait_category"
        <Result>
            Type "sql_server_wait_stats"
            InstancesFrom "wait_category"
            ValuesFrom "wait_time_ms" "waiting_tasks_count" "max_wait_time_ms"
        </Result>
    </Query>

    <Database "master">
        #Host "<somename>" # optional, the name that will be used by collectd to store the data
        Driver "freetds"
        DriverOption "host" "<sqlservername>"
        DriverOption "username" "sa"
        DriverOption "password" "<secretpass>"
        Query "perfstats"
        Query "waitstats"
    </Database>
</Plugin>

what the heck?

This is the first time I need a separate header for the heck that happened while setting this up.

freetds

freetds is simply ignoring the port option when using -S. At least there's a hint about it in the documentation. And even the strange logging method can be found in the documentation. Still, I was looking for a debug option in man.

SQL Server and data types

It seems like SQL Server replies with binary data when using the sum or max function. At first I was thinking cast(sum(whatever) as int) would be fine. But the data being returned required bigint. This didn't work either... Well, read the docs again.

Functions return bigint only if the parameter expression is a bigint data type. SQL Server does not automatically promote other integer data types (tinyint, smallint, and int) to bigint.

Not sure if I get this correctly, but basically SQL Server won't return a bigint when it in combination with cast. I ended up using float to get a proper return.

Conclusion

Don't ever try to monitor an SQL Server. Never trust Github Repos by Microsoft. They might be broken.