Logging with Structure

September 9, 2020

Logging is such an important part of any production service. It acts as your eyes and ears into the workings of a service, helping you debug issues and explain behaviour retrospectively, as well as letting you watch what a component is up to in real time.

The best thing about logging is it comes very naturally. When you first start writing code, before you even know what a debugger is, the print statement is your lifeline - and as it turns out, it continues to be very important! Just add some simple print statements on STDERR and you’re on your way to living a happy and well logged life.

The trouble is logging is a bit like commenting - it needs to be approached with care, lest you end up with a novel of ambiguous messages and a distinct lack of the information you need!

What to Log?

The most important part of logging is working out what you actually want to log. All too often I’ve been asked to debug an issue, and while there are plenty of log entries for the given request, there’s a sudden drought at the crucial moment.

It helps to think about log content as 2 categories:

Messages, explaining what is happening for a human
Key Data, to record important parameters needed for debugging

I think the best policy for what data to log, is to err on the side of caution and risk logging too much, rather than too little. Once the code in question has been faced with the real world for enough time, you’ll easily get a feel for what (if anything) is too noisy¹. In the meantime, you’re much less likely to find yourself without the information you need to debug an issue.

When writing log messages, succinctness is helpful. Your logging should be aimed at fellow developers, who have access to the source code, so they should be able to read the codebase alongside the log output in order to follow what went on and why. Having too much verbosity in your log messages can make them difficult to read, particularly if log lines start wrapping. There’s also the issue that too much verbosity in a log line can lead to it being left out-of-sync with the actual code, if you try to enshrine too much of the logic.

In your logging, I’ve found it most helpful to try to include, where applicable, the following bits of information:

A Correlation (or Trigger) ID

The first important thing you’ll want to ensure your logs contain is some kind of identifier for what triggered the processing. This is commonly referred to as a “correlation ID” - as it allows you to correlate events across a distributed system that relate to a distinct chain of events.

You’ll commonly get this from the metadata of whatever it was that triggered the work you’re doing. For example, if your service presents itself as an HTTP API server, then you might extract (and possibly generate) an X-Correlation-Id HTTP header to include in a log message. Otherwise, if your service acts upon messages from a message broker, then you might want to log the message ID.

This allows you to look between logs to show you the origin of a request that caused a certain action, or it can help you track down all of the changes made in response to a given request.

A Target Resource ID

Next, you’ll want to log what resource you’re operating on. Of course, often you’ll be working with multiple pieces of data, but there is usually one obvious primary target of a request - or just log multiple!

For a REST API, this would be the ID of the resource given in the URI. In general, it might be the primary key of a row in a relational database mapping to the object in use.

This allows you to search across logs to easily identify all of the actions that have been taken against a given object.

A Message

Of course, logging output should also contain messages that describe what action has been taken, or what decision has been made.

This is the most important part of your logging, as it’s what you and your fellow developers are going to end up reading - so make sure it’s human readable!

There are a couple pointers I’d recommend following:

Keep log messages succinct, so it’s easy to skim a log quickly
Aim for specificity and uniqueness, so it’s easy to identify where a certain log line originated, or to find all the instances where a certain piece of code has executed

Key Parameters

You will thank yourself for logging additional important parameters. What is important? Anything that affects the control flow of the program, that can’t be otherwise easily derived from other data you’ve logged.

For example, say you’re writing a service that manages your user’s public profile, and you’re adding logging for an endpoint that makes a post on the user’s timeline. You don’t want to log the text content of the post, but perhaps you have a flag that allows setting whether a post will notify the user’s friends. As the logic gets more complex, the circumstances under which notifications are sent might grow - so logging whether these notifications were explicitly requested may help to clear up confusion!

Also, be careful not to log anything sensitive, such as API keys or credentials.

A Timestamp

You almost certainly want to log timestamps, for obvious reasons!

You might implement timestamp writing actually in your application, or it might be automatically added by wherever you are writing logs. If you’re logging directly to a file, then you’ll need to add this yourself. If your logs are going to the SystemD journal, then every entry gets a timestamp for free, so you’ll want to avoid duplication!

Clear Delineation

Finally, you want your logging to clearly delineate the boundary between execution contexts. Often this ties in with the correlation ID, as it naturally lends itself to the task.

You want someone to be able to quickly search the logs for the relevant section, and then having each request clearly bounded in the logs will make it much easier for someone to quickly identify where relevant lines begin and end.

How to Log?

If you follow the advice above, you might end up with logging output a little bit like this (for some dummy service I’ve dreamt up):

2020-09-10 07:12:37 Received job 876459
2020-09-10 07:12:37 Updating stock count for product KL98277
2020-09-10 07:12:38 Creating restock notification, forced by product config
2020-09-10 07:12:38 Job 876459 completed

It’s only a toy example, but hopefully it illustrates the point of what we’re aiming for. The job ID is logged as a way of tracking where the request came from, so we could search across the logs from other services to find where this job ID was created, and the product ID we’re operating is extracted from the message body and displayed clearly, in case you want to see everything that has happened to that product. We then also provide an important detail about why we triggered a certain piece of logic - in this example, we assume that certain products can be configured in the database to always send restock notifications (perhaps they are high value items with a long lead time). We don’t necessarily need to log this information, but it might help us diagnose why we sent notifications retrospectively, particularly if you’ve got a lot of staff who are able to edit that flag!

This does reveal a potential issue though, in that important information can only be found by scrolling through logs manually to find relevant lines. This problem gets worse as code gets more complicated - and this is before we start discussing asynchronous or concurrent code!

Imagine a situation where suddenly you’ve had a complaint that the system has sent a large number of restock notifications. The first thing you might do is search your logs for matches of “Creating restock notification” - this will probably find you lots of messages that give you an easy feel for the general time of the incident, but it doesn’t really help diagnose! You’re likely more interested in finding out what jobs or products caused the notifications.

You can write a simple script to pull out that information, or you can make use of grep -B if the pattern is standard enough - or perhaps you can make your logging even more helpful…

Tagged Logging

Tagged logging is a method of logging whereby every message is prefixed with a set of tags that give the most critical information, wherever it is available. It provides a really simple solution to the issue of not being able to quickly identify what a log line is referring to.

With tagged logging, the small example might be transformed into something like:

2020-09-10 07:12:37 [876459] Starting job
2020-09-10 07:12:37 [876459][KL98277] Updating stock count
2020-09-10 07:12:38 [876459][KL98277] Creating restock notification, forced by product config
2020-09-10 07:12:38 [876459] Job completed

This is really easy to do, and makes your logs suddenly so much faster to read. A simple grep of the message we want instantly shows us a list of all product IDs it applies to!

Implementing tagged logging is extremely easy. In the simplest case, just remember to interpolate those tags at the start of every log message:

# We use log(s) to abstract away the actual details of output
print(f"[{job.id}][{product.id}] Updating stock count")

This is cumbersome, however, and risks breaking abstraction if you’re forced to pass metadata around your function calls (you don’t want to make every function in your code that wants to do logging have to take a job as an argument, do you?). Instead, you might consider making use of function composition (or the decorator pattern if working with a logging object):

def update_stock_count(log, product, delta):
    # This function doesn't need to care about applying tags, they're already there!
    log("Updating stock count")
    # [...]

# Elsewhere:

# log(s) could be the basic print wrapper
# or log(s) could be a previously wrapped tagging function
taglog = lambda s: log(f"[{product.id}] {s}")
update_stock_count(taglog, product, delta)

When using a logging library, you’ll see this pattern usually referred to as child loggers, or contextual loggers.

In cases like these, it’s important to pass your logger around the application, rather than using a global logger². You’ll find that adding tags to logs usually fits very well semantically with function calls, so I find it fits quite naturally to pass around manually³. Of course, you might find that cumbersome, so you could consider setting it as a property of certain objects - but you then need to be very careful of the scope of the object, as you don’t want your tags to be persisted outside of where they are relevant!

Structured Logging

Structured logging takes tagged logging one step further. A structured logger expects all information to be provided as key-value pairs, often even forcing you to assign a key to the human readable message you want to log (although some libraries will provide some sugar to make this most common requirement easy). This approach lends your logging output to be far more machine readable, and much easier to expand to support more complex logging backends than just simply printing to STDERR or appending to a file.

Implementing structured logging is a bit more involved, and probably no longer worth trying to roll your own, particularly if you want the benefits of being able to switch between logging backends. Thankfully, logging libraries are all over the place, and are usually quite flexible.

In python, structlog is a great choice.
If you’re writing Perl, you might consider Log::Any.
For Golang, I’m a fan of Go kit log.
In Node.js, try winston⁴.

What I’m trying to say is, whatever environment you’re working in, there will be a logging library that supports structured logging.

Using structured logging seems confusing at first, but it’s largely just a case of forcing yourself to be consistent with log contents. It’s nigh-on-impossible to log a value without making it clear what that value represents, thanks to being forced to give each value a corresponding key.

For example, I recently wrote a small Go service that attempts to keep my internet connection alive. The logging for this, using Go kit log looks something like:

level.Debug(logger).Log("remote_host", dest, "packets_sent", stats.PacketsSent, "packets_dropped", stats.PacketsSent-stats.PacketsRecv, "latency", stats.AvgRtt, "msg", "ping complete")

By default, this gets formatted to the console like so (with an extra line thrown in for comparison):

level=debug ts=2020-09-10T22:32:58.516539Z remote_host=cucurbit.dev packets_sent=3 packets_dropped=0 latency=40.675733ms msg="ping complete"
level=debug ts=2020-09-10T22:32:58.516539Z msg="connectivity ok"

This simple default formatting is known as logfmt⁵, but it should hopefully be apparent that a structured logging interface lends itself to all manner of output formats. For example, you could log as JSON or YAML, if you’re wanting to parse these logs later. Alternatively, you could send these logs to a dedicated logging system, such as Logstash in the ELK Stack, or SystemD’s journal, which will give you tools to easily search logs by filtering to specific key-value pairs.

Any good logging library will allow you to format and write logs to wherever you like, either by using an off the shelf adapter/formatter, or by providing an easily extensible API for you to implement your own. You’ll thank yourself for that flexibility later on, when you don’t need to replace all of your logging code to satisfy the whims of the devops team!

Wrap-Up

Logging is easy, but also easy to get wrong. It’s incredibly important for any long running service to have good logs, as it’s your window into a running system where you’re not going to be able to catch an issue with a breakpoint in a debugger.

In this post I’ve only scratched the surface of logging, but hopefully I’ve shown some basics on how to make flexible and developer friendly logging decisions that you’ll thank yourself for later. One thing to look into further is levelled logging, which gives you a way to easily configure how much to log without needing code changes⁶.

Don’t forget when implementing a service to think about metrification too! Logging is there to let you drill down into the details of what the system did. Metrics will help augment that by giving you a higher level aggregated view of what a system does and how often it does those things.

Maybe I’ll follow up with some further thoughts on logging backends and metrics one day!

And in a lot of cases, small tweaks to log retention and a bit of compression can save you all the space you need! ↩︎
Definitely do not try to use tagged logging with a global logger! I’ve had to work with code that tries to work like this, creating a global logging object with push and pop functions for tags. It is cleaner when you get started, but quickly falls apart in the face of concurrency and asynchronicity - you’ll soon find yourself swimming in incorrectly nested tags spread all across your logs! The language runtime/compiler is built to handle the kind of scoping that this push/pop approach is trying to recreate, so make use of that power! ↩︎
I like to treat the logger somewhat like a context.Context in Golang, which ensures it’s always relevant - but does have the drawback of muddying your interfaces. I think there’s probably scope to make use of a Context’s built-in key-value store to help, but that’s an idea for another time! ↩︎
I like winston, although I will admit the documentation here can be a little lacking. ↩︎
Definitely recommend reading Brandur’s logging articles. ↩︎
For example, you might want to set the verbosity based on an environment variable. In production, perhaps you only want to log informational messages and have a long retention period. In staging, you’ll almost certainly want the highest possible verbosity, but the longevity of these logs isn’t as important, so you can happily trade more detailed for less history. ↩︎