Parsible: Straightforward Log Parsing

The Graduate

Scratching the Itch

It’s something that everyone needs to do, a coming of age experience for a tech company. Yes, thats right, I am talking about log parsing. And, much like other coming of age stories, everyone does it a little bit differently. There are many so many decisions to make:

Is this just a one time thing or a long term commitment?

What happens if we change our mind about what we want?

Who’s going to be dealing with this mess?

But enough with the euphemisms.

What we wanted in our Log Parser … and Why:

Written in Python: We are a Python shop and it makes it easy for us if our technical knowledge is readily applicable to our tools. For us, Python is always the best language for a system because there are more than 10 serious Pythonistas in the office. If you work at a Ruby shop, favor Ruby where it makes sense.
Batch Mode and Continuous: We want to be able to use our parser to continually feed our graphs as new data comes in - we don’t like waiting. This doesn’t mean we don’t sit down with our old logs from time to time and peek under the covers. We want to be able to use the same tool for static analysis as we do for real time tasks. We reduce overhead if we can use one tool for both.
Simple to Use: One of our core beliefs about metrics and graphs is that the people who care about them need to be the ones watching them. Sysadmins staring at Clicks/Minute can be helpful, but it’s much better when Analytics has them on their screen.

By extension, it means that we want Analytics defining what they want charted and recorded. If you simplify the interactions with the data, you lower the bar for them to set it up themselves. This is solved by a simple plugin based architecture that abstracts away the complexity from the end user so they can focus on getting value from the data.

Low Impact: If we want to parse the data as it comes in, it means we will be on production machines. This means no hogging resources from the user facing components.
Free: Sorry Splunk, you’re awesome but a bit pricey for Startups.
Can be described by lots of Buzzwords: Self-Explanatory

Yet Another Log Parser?

Alice: I wrote a log parser! Isn’t it awesome?!

Bob: But there are already tools that do this!

Alice: Really?

Bob: Well, mostly…

There are more tools to do these sorts of things than you can shake a stick at, some of them even fit the requirements! The better ones (for us) that we have seen:

Ganglia Logtailer: A framework focused on ripping your data out and pushing it to Ganglia. It supports both batch and daemon modes, and it is written in Python! However, looking at the source it doesn’t inspire confidence that non-developers will have an easy go of it.
Logster: A nice fork of Ganglia Logtailer that is tailored toward cron-style execution with some additional outputs. More focused on simplicity, it echoes many of Etsy’s other high quality releases. Unfortunately it is cron only and hasn’t made enough headway towards simplicity.

The Best Parser

Splunk: An awesome program for doing anything with logs of any sort. It can monitor, alert, search, graph and correlate. Super easy to use too! However, the free version is mostly a toy and the enterprise version is not what most would call cheap:

Splunk Enterprise pricing in North America starts at U.S. $6,000 for a 500 megabyte-per-day perpetual license, including first year support, or U.S. $2,000 per year for a term license including support.

A pretty sobering price since most companies will be generating and searching many gigabytes of logs per day. The cost itself will make people hesitate to use it on new data, limiting its effectiveness. Too rich for our blood.

We came to the decision that we should roll our own. After formulating the requirements and spending far too much time with grep we built Parsible. The code is Open Source (obviously) and a fresh fork will be ready to begin parsing your own stuff.

How does it work?

When you want to dig into a log with Parsible, you need three things:

1. A Parser: This will be fed single lines from the log file and will return data in a format of your choosing. We like to build dictionaries by using regex.

Parsing functions need to live in the parsers directory and the method should start with parse_, Parsible will find them from there.

2. A Processor: This is what will act on your parsed data. This is where most of the useful filtering is going to be done and where the data will be sent on its way. It can be as simple as checking the status code of an HTTP request or getting the time to serve a request to the user.

Much like the parsers, processors live in the processors directory and methods need to start with process_. Parsible will automatically discover, import, and run these for each line of your file, feeding them with the data returned from your Parser.

3. An Output: Now that you have your data you need to get it somewhere useful. Your outputs should be reusable so that they can be called from many different processors.

There are no strict naming or placement conventions, but it is simplest if you start them with output_ and put them in the outputs directory.

Why is this better?

Setting up the framework in this way makes it easy for one person to compile a useful set of parsers, outputs and helper functions that can be used by anyone who wants to write a custom processor. This allows a layer of indirection between writing the parser (a developer centric task), the processors (developers, analytics or business units), and the outputs (more developer work).

Our work flow gets much simpler with this set up and extracting new data becomes a simple process:

User:

Clone our internal repo and pick the branch for your desired logs
Write a new processor
Hook up to a ready-made output
Push to the repo

Behind the Scenes:

Chef updates the Git repo on the machine
Parsible is restarted (Runs under supervisord)
StatsD automatically adapts to the new data
Graphite + Graphiti start producing awesome graphs for immediate consumption.

Ready for some examples?

Parsers

A parser for Nginx timed_combined logs, modified slightly from what we use in production (we have some custom fields).

import re

def parse_nginx(line):
    regex = re.compile("(?P<ip_address>\S*)\s-\s(?P<requesting_user>\S*)\s\[(?P<timestamp>.*?)\]\s\s\"(?P<method>\S*)\s*(?P<request>\S*)\s*(HTTP\/)*(?P<http_version>.*?)\"\s(?P<response_code>\d{3})\s(?P<size>\S*)\s\"(?P<referrer>[^\"]*)\"\s\"(?P<client>[^\"]*)\"\s(?P<service_time>\S*)\s(?P<application_time>\S*)\s(?P<pipe>\S*)")
    r = regex.search(line)
    result_set = {}
    if r:
        for k, v in r.groupdict().iteritems():
            # Normalize our data
            if v is None or v is "-":
                continue
            # Make the dictionary fields a bit more useful
            if "request" in k:
                if "?" in v:
                    request = v.partition("?")
                    path = request[0]
                    query = request[2]
                    result_set["path"] = path
                    result_set["query"] = query
                    r.groupdict().pop(k)
                    continue
                else:
                    result_set["path"] = r.groupdict().pop(k)
                    continue
            result_set[k] = r.groupdict().pop(k)
    # This becomes what is passed to other functions as 'line'
    return result_set

Processors

Here we use the output of the parser to generate a count of how many API hits we have in this very contrived example. In reality we have 10-20 processors run against each line with a bevy of helper functions to do things like pick out user agents and bots or filter down referrers.

from plugins.outputs.statsd import output_statsd_count

def process_api(line):
    if 'path' in line.keys():
        if line['path'].startswith('/api/'):
            output_statsd_count('api')

Outputs

Check out the simple StatsD output that is called processor above. Any outputs can easily be imported into a processor in the standard fashion. When running with --batch-mode True for some ad-hoc analysis you could have an output that is as simple as printing a line.

# A Sample output function that pushes data to StatsD
from socket import socket, AF_INET, SOCK_DGRAM

hostname = "127.0.0.1"
port = 8125
udp = socket(AF_INET,SOCK_DGRAM)

# Good to stay non blocking so requests don't
# pile up
def _send_statsd(data):
    udp.sendto(data, (hostname, port))

def output_statsd_count(stat, count=None):
    if count is None:
        count = 1
    data = "{0}:{1}|c".format(stat, count)
    _send_statsd(data)

Not too impressive as StatsD already makes it really easy to send data over. It gets more interesting when it is used to illustrate how simple it is to write new processors. Right now we send a lot of data over to StatsD, but we also have hookups for Redis and our internal alerting software.

Obviously these are trivial examples, but head on over to the Parsible page on Github for more practical examples including the Nginx parser, StatsD outputs, and a few custom processors that filter out requests by User Agent and OS.

Happy Parsing!

Andrew Gross is a Developer at Yipit

TL;DR We made a pluggable parsing engine called Parsible

Yipit Django Blog