Matt's database tidbits

Mar 19, 2013

pgBadger is a Perl script created as a successor to pgFouine. It reads PostgreSQL logfiles and creates a dashboard of basic metrics and log messages that can be useful in analyzing past server performance. The latency of the information is determined by the frequency you run pgBadger across your logs.

I use a simple setup to parse logs with PdBadger and expose them via lighttpd, with directory browsing to view history. Any web server you have laying about would also do - its simple static content.

A cron job calling the following script every 15 minutes generates the current date's page. Specifying a date as an argument creates a page from a past date (I rotate logs daily):

#!/bin/bash
if [ $# -eq 0 ]; then
    dt=$(date +%Y-%m-%d)
else
    dt=$1
fi
fd=$(date -d "$dt" +%Y%m%d)

pgbadger -T '<a href="/logs">pgBadger</a>' -o /var/www/pglogs/pgsql_${fd}.html /var/lib/pgsql/data/pg_log/postgresql-${dt}_*.log

index="<meta HTTP-EQUIV='REFRESH' content='0; url=logs/pgsql_${fd}.html'>"
echo $index > /var/www/pglogs/index.html

The root of the site displays the current days log information, and the -T parameter creates a link in the page header to the browsable directory of old logs. Lighttpd is in part configured for local directory browsing as:

server.document-root = "/var/www/pglogs"

$HTTP["url"] =~ "^/logs($|/)" {
    dir-listing.activate = "enable"
    dir-listing.hide-dotfiles = "enable"
    dir-listing.exclude = ( "\.sh$", "\.conf$" )
   }

index-file.names = ( "index.html" )

Although just a static view on a set of logfiles, the ease of implementation makes this a great way to keep tabs on a database server. Add tail_n_mail, or tie check_postgres into your Ops monitoring platform and you can get notifications when to look at your pgBadger dashboard.

May 24, 2012

A tweet by @hmason pointed me to a handy utility called pv, or pipe viewer, which monitors data flowing through a pipeline.

For PostgreSQL backups performed interactively (or any database dump utility), a quick progress monitor can be shown based on database size: gist:2783910

# dump
pg_dump testdb | pv -c -s (psql -tc "SELECT pg_database_size('testdb')") -N dump | gzip | pv -N gzip > testdb.sql.gz

# restore gzip
gunzip -c testdb_20120501.sql.gz | pv -s $(gzip -ql testdb_20120501.sql.gz | awk '{print $1}') -N unzip | psql testdb

#restore sql
pv -s $(du -sb testdb_20120501.sql | awk '{print $1}') testdb_20120501.sql | psql testdb

Progress indicators will appear similar to:

dump:  359MB 0:00:14 [28.4MB/s] [>                  ]  0% ETA 2:28:41 
gzip: 47.3MB 0:00:14 [3.78MB/s] [    <=>                            ]
Jan 23, 2012

I have a series of jobs that log to the console, and tee to logfiles. It appears the default date logging is in the format of dd-MM, which results in a non-sortable sequence of events (with western date formats).

After some digging, trial and error, it turns out that a simple 2 line change to the source, and an recompile is all that was needed - actually less effort than altering all of my ETL jobs.

I guess I will call this a win for open source - Don't like it? Fix it.

For PDI 4.2.1:

Index: src/log4j.xml
===================================================================
--- src/log4j.xml   (revision 16273)
+++ src/log4j.xml   (working copy)
@@ -32,7 +32,7 @@
      I imagine it gets cached and reused throughout the life of the application). 

      -->
-         
+         

Index: src-core/org/pentaho/di/core/logging/LogWriter.java
===================================================================
--- src-core/org/pentaho/di/core/logging/LogWriter.java (revision 16273)
+++ src-core/org/pentaho/di/core/logging/LogWriter.java (working copy)
@@ -101,7 +101,7 @@
     // Play it safe, if another console appender exists for org.pentaho, don't add another one...
     //
     if (!consoleAppenderFound) {
-           Layout patternLayout = new PatternLayout("%-5p %d{dd-MM HH:mm:ss,SSS} - %m%n");
+           Layout patternLayout = new PatternLayout("%-5p %d{yyyy-MM-dd HH:mm:ss,SSS} - %m%n");
        ConsoleAppender consoleAppender = new ConsoleAppender(patternLayout);
        consoleAppender.setName(STRING_PENTAHO_DI_CONSOLE_APPENDER);
        pentahoLogger.addAppender(consoleAppender);
Aug 23, 2011

After experimenting with MongoDB as a datastore for log file data, I wanted to look for ways to improve the import rate. The project has also expanded to using MS SQL Server and PostgreSQL as storage targets, and in all cases, inserting data via multiple connections balances against the performance loss seen by not using native bulk insert tools.

It turns out that except for a few issues on Windows (lacking an a fork method in the OS), it is a fairly straightforward to spawn multiple import processes, so that you can maximize CPU usage to the limit of your storage IO: http://code.activestate.com/recipes/577856-multiprocessing-import-wrapper/

Apr 22, 2011

I have read an article or two suggesting MongoDB is useful as a logging target. So, I thought I would create a simple prototype to try this out. My sample logs are in key=value format (event = description). Rather than write code that parses the log lines, then inserting into the Mongo database, I opted to translate the log lines to JSON objects, to allow the mongoimport utility to perform the bulk imports. After a few trial runs, a simple converter looked like the following:

#!/usr/bin/env python

import json, re, sys
infile = sys.argv[1]

if len(sys.argv) > 2 : outfile = sys.argv[2]
else: outfile = infile + '.json'

re_kv = re.compile(r'(\w+)=(\S*)')

out = open(outfile,'w')
for line in open(infile,'r').readlines():

    js = dict(re_kv.findall(line))
    js = json.dumps(js, indent=4)

    out.writelines(js + '\n')    
out.close()

On my 2009 MBP, I was able to convert log files about 3,200 rows / second, and the import utility loaded the files at 10K rows / second (without the journaling option). This leaves me a quick datastore to use map/reduce queries to perform ad-hoc log analysis upon. The Python script above may not be the most efficient possible, but a optimizations on the first version (compiling the regex, and replacing a list comprehension with a string join), has made it reasonably fast. If anyone can suggest faster parsing logic, please chime in.

Update 2011-06-21:

The above code produces a simple json formatted file for the MongoDB bulk import utility. As this project has evolved, the need for nested documents arose, where a sequence of sub-events comprise a complete log entry. Thus importing a flat file is not sufficient. The pattern has evolved to create the parent record, then upsert inner task items into an array:

if type == 'EVENT_START':
    log.save(event, manipulate=False, safe=True)
elif type == 'TASK':
    log.update({'eventID':event['eventID']}, {'$push':{"tasks":log}}, upsert=True, safe=True)
elif type == 'EVENT_END':
    log.update({'eventID':log['eventID']}, {'$set':event}, upsert=True, safe=True)

The advantage here over a traditional RDBMS is that there is no lookup activity between tables when inserting or reading to get a chain of tasks that are part of a logged event. The three statements above quickly and simply create a single document that comprise all activity for an event made up of 1..n task elements.

Jan 26, 2011

I have been a “Relational Database Practitioner” for more than a few years now. Starting with xBase tools (dBase, FoxPro), then DB2/2 (DB2 on OS/2, the good old days), and eventually onto a decade of Microsoft SQL Server. Recently I have been looking into PostgreSQL as a low cost 1, capable option.

When the “NoSQL” movement began, I like many in my field was skeptical. Much of my career was defined by projects that involved taking a “mess of data”, stored in a variety of formats, and coalescing it into something useful. The idea of storing data in a system that did not enforce data integrity through constraints, types, and a fixed schema seemed like a step in the wrong direction. I was also unsure if these new technologies were actually based on new ideas. Looking at document storage engines like CouchDB, I recalled days at shops that ran business applications on Lotus Notes.

It is also worth noting that many shops I have worked in tend to be monotonic. They base all efforts on a single platform or vendor, in an attempt to control complexity and cost. The only development options were those that were part of the small set of tools available with through the purchased subscription. Unfortunately, experience has taught me that when your tools are limited, often so are the quality of your solutions. A open mind should come with an open toolset.

At my current company, I find myself in an environment that not only utilizes a variety of technologies, but also challenges the points above. We manage data in a range of tools, some relational, some non-relational (and some very non-traditional). We also do not limit platforms to a single vendor. When projects are launched, decisions are made based on the viability of a technology to the solve the issue at hand. All vendors are an option; both commercial and open-source tools can be evaluated.

During a personal trip I took the opportunity to attend MongoDC, a single day conference put on by the purveyors of MongoDB. Here I was pleased to find not only a good bit of information about the database, its operation, and the very enthusiastic company behind it, but a large segment of the audience who were like me - relational practitioners looking for ways to make life easier - and perhaps a bit more "fun". A larger than expected segment of the audience were not rabid RDBMS haters, but those looking to solve more problems with more options. In other words, these were relational DBAs with an open mind.

Thus like others I challenge some of the thoughts in the “NoSQL” movement. I do not see non-relational databases as a wholesale replacement for traditional RDBMS platforms. I see them as a complement. There are cases where the performance implications of using joins due to normalization and rigid schemas can present scalability problems, where an easily implemented scale-out approach can solve. But on the other hand, there are cases where validation of data at the database level is of critical importance; transactions that cross tables are required, acknowledgement that data has been written to disk is critical. This is where the safeguards of an RDBMS are a requirement. Thus both approaches are valid, depending on the needs of a particular application, or layer of a system.

Like the static vs dynamic debate, or the Vi vs Emacs wars, the winners are those who choose what works best, each time they are faced with a problem to solve. In the concept of “using the right tool for the job”, I prefer instead to think of the moniker as #NotOnlySQL. Others I work with do too.


  1. “low cost” in this case means the software is free, but there is a time and effort expense in learning to properly administer a new platform. In the long run, I believe the TCO may be an advantage. 

Apr 3, 2010

When writing a utility in C# to automate database upgrades in the field, I ran into a problem with getting status messages generated by T-SQL scripts in PRINT statements. The goal was a tool that would execute scripts in a similar fashion as SSMS or SQLCMD, supporting batch terminators ("GO").

SMO offers the ExecuteNonQuery method, which unlike its ADO.NET counterpart, can run multiple batches in a single script, separated by GO statements. However, capturing return messages was not as as obvious as I hoped.

A bit of searching led me to the following approach (partial snippet):

using Microsoft.SqlServer.Management.Common; 
using Microsoft.SqlServer.Management.Smo;

private static Server _server;

string sql = ""; 
string msg = "";
int rows;

StringBuilder infoMessage = new StringBuilder();

// Capture return messages (PRINT statements) 
SqlInfoMessageEventHandler messagehandler = delegate(object sender, SqlInfoMessageEventArgs e) { infoMessage.Append(e.Message); };

_server.ConnectionContext.InfoMessage += messagehandler; 
_server.ConnectionContext.StatementTimeout = 0;

rows = _server.ConnectionContext.ExecuteNonQuery(sql);

// Get text of PRINT statements in scripts 
msg = infoMessage.ToString();

// Remove handler to prevent a memory leak 
_server.ConnectionContext.InfoMessage -= messagehandler;

This last line is critical, otherwise you will find as I did that this mechanism will consume a significant amount of memory when you run a long series of scripts in a loop, as message handlers are created, but never released and garbage collected.

Mar 25, 2010

Once I figured out how to connect to SQL Server via Python scripts on a Mac, I still needed a way to get Windows OS details about my SQL Servers. There are many things that can be queried through SQL Server itself (ex: xp_fixeddrives), but some OS object are unavailable.

WMI - Windows Management Instrumentation - would be the way to go through PowerShell, but there is no WMI nor PowerShell on a Mac. But there is a CMI client developed by Zenoss that does the trick. In fact, it provides a query interface with SELECT statements. Good DBA-type stuff.

Here is the process I used on Snow Leopard, resulting in a WMI console client:

svn co http://dev.zenoss.org/svn/trunk/wmi zenoss 
cd zenoss
cp GNUmakefile makefile
chown (user) makefile
sudo make
wmic 

wmic should return usage information, and thus will let you know that the client built properly. This example will return all the information you could want about the disks on a Windows box:

./wmic -U user //(domain/)server "select * from Win32_LogicalDisk"
Mar 25, 2010

I am in a new role, where I not only work with MS SQL Server on Windows servers, but now use a MacBook Pro as my primary workstation (win!), and interact with system monitoring applications hosted on Linux boxes. One of the first challenges in this role was to port my old but trusty SQL Server monitoring scripts in Python to a more OS neutral format 1. Bye-bye ADO (they were written 2003-2005) - hello ODBC my old friend.

Turns out, that with a little research, this was fairly easy, and the database connection/query methods in the scripts are now a bit more concise. I chose to use the pyodbc module developed by Michael Kleehammer (it is good stuff) as an ODBC API, along with FreeTDS for the network protocol.

What follows is the process for building the driver and module, and a short test script to verify things are working:

  • (Install Xcode for compiler if needed)

  • Install iODBC: http://www.iodbc.org (may be optional if you already have the iODBC manager installed, or you are using unixODBC)

  • Install FreeTDS from source:

    ./configure --with-tdsver=8.0 --enable-msdblib
    
    make
    
    sudo make install
    
  • Test:

    tsql -H "server(\instance)" -p 1433 -U login -P password
    
  • Install pyodbc from source:

    python setup.py build
    
    python setup.py install
    
  • (optional) Configure ODBC driver in iODBC:

    Description: TDS Driver file: /usr/local/lib/libtdsodbc.so
    Setup file: /usr/local/lib/libtdsodbc.so
    Define as: System
    
  • Python test script 2:

    import pyodbc
    
    cs = (r'DRIVER={TDS};SERVER=server\\instance;DATABASE=database;UID=login;PWD=password;' )
    cnx = pyodbc.connect(cs)
    cur = cnx.cursor()
    
    sql = ('SELECT @@VERSION') 
    
    rows = cur.execute(sql).fetchall() 
    for row in rows:
    print row 
    
    cnx.close()
    

If the above script returns version information from your SQL Server instance, then all is well!


  1. After this all worked for several days, I started my MacBook up one day, and my scripts presented an "image not found" error. Setting the driver to a path instead of a name has temporarily resolved the error. My Unix-fu is still weak, so there may be a better solution to this problem. I started a thread in the pyodbc lists to seek other solutions: http://groups.google.com/group/pyodbc/browse_thread/thread/a8b202424660708f 

  2. It has been pointed out that SNMP is a better way to monitor databases, and as it is a monitoring standard, I might agree. This post is an example of one way to connect to SQL Server from OSX using simple code, not a comprehensive monitoring framework. The Python scripts mentioned perform simple queries and exit - alternate approaches are better suited for live 24x7 monitoring.