Causality – A Mechanism for Relating Distributed Diagnostic Contexts

Definition

Causality: The relationship between cause and effect [OED].

Introduction

Supporting a Distributed System can be hard. When the system has many moving parts it is often difficult to piece together exactly what happened, at what time, and within which component or service. Despite advances in programming languages and runtime libraries the humble text format log file is still a mainstream technique for recording significant events within a system whilst it operates. However, logging libraries generally only concern themselves with ensuring that you can piece together the events from a single process; the moment you start to invoke remote services and pass messages around the context is often lost, or has to be manually reconstructed. If the recipient of a remote call is a busy multi-threaded service then you also have to start picking the context of interest out of all the other contexts before you can even start to analyse the remote flow.

 

This article will show one mechanism for overcoming this problem by borrowing a hidden feature of DCOM and then exposing it using an old design pattern from Neil Harrison.

Manually Stitching Events Together

The humble text log file is still the preferred format for capturing diagnostic information. Although some attempts have been made to try and use richer encodings such as XML, a simple one line per event/fixed width fields format is still favoured by many [Nygard].

 

For a single-process/single-threaded application you can get away with just a timestamp, perhaps a severity level and the message content, e.g.

 

2013-01-01 17:23:46 INF Starting something rather important

 

Once the number of processes starts to rack up, along with the number of threads you need to start including a Process ID (PID) and Thread ID (TID) too, either in the log file name, within the log entries themselves, or maybe even in both, e.g.

 

2013-01-01 17:23:46 1234 002 INF Starting something rather important

 

Even if you are mixing single-threaded engine processes with multi-threaded services it is still desirable to maintain a consistent log file format to make searching and parsing easier. For the sake of this article though, which is bound by the constraints of print based publishing, I’m going to drop some of the fields to make the log output shorter. The date, PID and severity are all tangential to most of what I’m going to show and so will be dropped leaving just the time, TID and message, e.g.

 

17:23:46 002 Starting something rather important

 

Assuming you can locate the correct log file to start with, you then need to be able to home-in on the temporal set of events that you’re interested in. One common technique for dealing with this has been to manually annotate log lines with the salient attributes of the task inputs, e.g.

 

17:23:45 002 Handling request from 192.168.1.1 for oldwoodc

17:23:46 002 Doing other stuff now

17:23:47 002 Now doing even more stuff

. . .

17:23:59 002 Finished handling request from 192.168.1.1

 

If your process is single-threaded you can probably get away with putting the key context details on just the first and last lines, and then just assume that everything in between belongs to the same context. Alternatively you can try and “remember” to include the salient attributes in every log message you write.

 

17:23:45 002 Handling request from 192.168.1.1

17:23:46 002 Doing other stuff now (192.168.1.1)

17:23:47 002 [192.168.1.1] Now doing even more stuff

. . .

17:23:59 002 Finished handling from 192.168.1.1

 

Either way there is too much manual jiggery-pokery going on and as you can see from the last example you have to rely on all developers using a consistent style if you want a fighting chance of filtering the context successfully later.

Diagnostic Contexts

The first problem we want to solve is how to “tag” the current context (i.e. a single thread/call stack in the first instance) so that whenever we go to render a log message we can automatically annotate the message with the key details (so we can then grep for them later). More importantly, we’d like to do this in such a way that any code that is not already aware of our higher-level business goals remains blissfully unaware of them too.

 

In Pattern Languages of Program Design Vol. 3, Neil Harrison presents a number of logging related design patterns [Harrison], one of which is called Diagnostic Context. In it he describes a technique for associating arbitrary data with what he calls a “transaction”. The term transaction is often heavily associated with databases these days, but the transactions we are concerned with here are on a much larger scale, e.g. a single user’s “session” on a web site.

 

A distributed system would therefore have many diagnostic contexts which are related somehow. The connection between these could be viewed as a parent/child relationship (or perhaps global/local). There is no reason why a context couldn’t store different “categories” of tags (such as problem domain and technical domain), in which case the term namespace might be more appropriate. However this article is not so much concerned with the various scopes or namespaces that you might create to partition your contexts but more about how you go about relating them. As you will see later it is a specific subset of the tags that interests us most.

 

Although you could conceivably maintain one context per task that acquires more and more tags as you traverse each service layer, you would in effect be creating a Big Ball of Mud. However, the more tags you create the more you’ll have to marshal and ultimately the more you’ll have to write to your log file and then read again when searching. Although the I/O costs should be borne in mind, the readability of your logs is paramount if you’re to use them effectively when the time comes. And so multiple smaller contexts are preferred, with thread and service boundaries providing natural limits.

Implementing a Simple Diagnostic Context

The implementation for a Diagnostic Context can be as simple as a map (in C++) or a Dictionary (in C#) which stores a set of string key/value pairs (a tag) that relates to the current operation. The container will almost certainly utilise thread-local storage to allow multiple contexts to exist simultaneously for the multiple threads within the same process.

 

It should be noted that some 3rd party logging frameworks already have support for diagnostic contexts built-in. However, they may not be usable in the way this article suggests and so you may still need an implementation like the simple one shown below.

 

At the entry point to our “transaction” processing code we can push the relevant tags into the container for use later. By leveraging the RAII idiom in C++ or the Dispose pattern in C# we can make the attaching and detaching of tags automatic, even in the face of exceptions. For example in C# we could write:-

 

public void ProcessRequest(Request request)
{

    using(new DiagnosticContextTag(“ID”, request.Id))

    using(new DiagnosticContextTag(“HOST”, request.Host))

    {

        // Do some funky stuff with the request.

        . . .

    }

}

 

Behind the scenes the constructor adds the tag to the underlying container and removes it again in the destructor/Dispose method. The need for the code to be exception safe is important as we don’t want the tags of one context to “leak” by accident and infect the parent context because it would cause unnecessary pollution later when we are searching.

 

As Harrison’s original pattern suggests we can create contexts-within-contexts by using stack-like push/pop operations instead of just a simple add/remove. However you still want to be careful you don’t overload the meaning of any tag (e.g. “ID”) that will be used across related scopes as, once again, it will only create confusion later.

 

When the time finally comes to render a log message we can extract the set of tags that relate to this thread context, format them up nicely and append them to the caller’s message behind the scenes, e.g.

 

public void WriteLogMessage(string message)

{

    string timestamp = FormatTimestamp();

    string threadId = FormatThreadId();

    string context = Context.Format(); // “ID=1234”

 

    Console.WriteLine(“{0} {1} {2} [{3}]”,
                      timestamp, threadId, message, context);

}

 

The example above would generate a log line like this:-

 

17:23:46 002 Doing other stuff now [ID=1234]

 

The statement Context.Format(); hopefully shows that I’ve chosen here to implement the diagnostic context as a static Façade. This is the same façade that the constructor and destructor of DiagnosticContextTag would have used earlier to attach and detach the attributes. In C# the diagnostic context could be implemented like this:-

 

public static class Context

{

    internal static void Attach(string key, string value)

    {

        s_tags.Add(key, value);

    }

 

    internal static void Detach(string key)

    {

        s_tags.Remove(key);

    }

 

    public static string Format()

    {

        var builder = new StringBuilder();

 

        foreach(var tag in s_tags)

        {

            builder.AppendFormat(“{0}={1}”, tag.Key, tag.Value);

        }

 

        return builder.ToString();

    }

 

    [ThreadLocal]

    private static IDictionary<string, string> s_tags =

                                   new Dictionary<string, string>();

}

 

The Attach/Detach methods here have been marked internal to show that tags should only be manipulated via the public DiagnosticContextTag helper class.

 

public class DiagnosticContextTag : IDispose

{

    public DiagnosticContextTag(string key, string value)

    {

        Context.Attach(key, value);

 

        m_key = key;

    }

 

    public void Dispose()

    {

        Context.Detach(m_key);

    }

 

    private string m_key;

}

Distributed COM/COM+

The second aspect of this mechanism comes from DCOM/COM+. Each call-chain in DCOM is assigned a unique ID (a GUID in this case) called the Causality ID [Ewald]. This plays the role of the Logical Thread ID as the function calls move across threads, outside the process to other local processes and possibly even across the wire to remote hosts (i.e. RPC). In DCOM this unique ID is required to stop a call from deadlocking with itself when the logical call-chain suddenly becomes re-entrant. For example Component A might call Component B (across the wire) which locally calls C which then calls all the way back across the wire into A again. From A’s perspective it might seem like a new function call but via the Causality ID it can determine that it’s actually just an extension of the original one.

 

This Causality ID is allocated by the COM infrastructure and passed around transparently – the programmer is largely unaware of it.

The Primary Causality

The Causality mechanism is therefore nothing more than a combination of these two ideas. It is about capturing the primary tags used to describe a task, action, operation, etc. and allowing them to be passed around, both within the same process and across the wire to remote services in such a way that it is mostly transparent to the business logic.

 

As discussed earlier, the reason for distilling the entire context down into one or more simple values is that it reduces the noise as the processing of the task starts to acquire more and more ancillary tags as you move from service to service. The local diagnostic context will be useful in answering questions within a specific component, but the primary causality will allow you to relate the various distributed contexts to one another and navigate between them.

 

A GUID may be an obvious choice for a unique causality ID (as DCOM does), and failing any alternatives it might just have to do. But they are not very pleasing to the eye when browsing log data. If the request is tracked within a database via an Identity column then that could provide a succinct integral value, but it’s still not easy to eyeball.

 

A better choice might be to use some textual data from the request itself, perhaps in combination with an ID, such as the name of the customer/user invoking it. The primary causality could be a single compound tag with a separator, e.g. 1234/Oldwood/192.168.1.1 or it could be stored as separate tags, e.g. ID=1234, LOGIN=Oldwood, IP=192.168.1. Ultimately it’s down to grep-ability but the human visual system is good at spotting patterns too and if it’s possible to utilise that as well it’s a bonus.

 

Putting all this together so far, along with a static helper, Causality.Attach(), to try and reduce the client-side noise, we could write the following single-threaded, multi-step workflow (a top-level request that invokes various sub-tasks):-

 

public void ProcessRequest(Request request)

{

    using(Causality.Attach(“RequestId”, request.Id))

    {

        foreach(var task in request.Tasks)

        {

            Log.WriteLogMessage(“Starting request”);

            ProcessTask(task);

            Log.WriteLogMessage (“Request completed”);

        }

    }

}

 

public void ProcessTask(Task task)

{

    using(Causality.Attach(“TaskId”, task.Id))

    {

        Log.WriteLogMessage (“Starting task”);

        . . .

        Log.WriteLogMessage (“Task completed”);

    }

}

 

This could generate the following output:-

 

17:50:01 001 Starting request [RequestId=1234]

17:50:02 001 Starting task [RequestId=1234;TaskId=1]

17:50:02 001 Doing user stuff [RequestId=1234;TaskId=1;User=Chris]

. . .

17:50:03 001 Task completed [RequestId=1234;TaskId=1]

17:50:02 001 Starting task [RequestId=1234;TaskId=2]

17:50:02 001 Doing payment stuff [RequestId=1234;TaskId=2;Type=Card]

. . .

17:50:03 001 Task completed [RequestId=1234;TaskId=2]

. . .

17:50:01 001 Request completed [RequestId=1234]

 

The decision on whether to extend the primary causality with the TaskId or just let it remain part of the local context will depend on how easy it is for you to piece together your workflow as it crosses the service boundaries.

Marshalling the Primary Causality across the Wire

We’ve removed much of the tedium associated with logging the context for a single-threaded operation, but that leaves the obvious question - how do you pass that across the wire to another service? We don’t usually have the luxury of owning the infrastructure used to implement our choice of transports but there is nothing to stop us differentiating between the logical interface used to make a request and the wire-level interface used to implement it. The wire-level interface may well already be different if we know of a more efficient way to transport the data (e.g. compression) when serializing. If we do separate these two concerns we can place our plumbing right on the boundary inside the proxy where the client can remain unaware of it, just as they are probably already unaware there is an RPC call in the first place.

 

The following logical interface describes the client side of an example service to request the details of a “bug” in an imaginary bug tracking system:-

 

interface IBugTrackerService

{

    Bug FetchBug(int id);

}

 

class BugTrackerProxy : IBugTrackerService
{

    public Bug FetchBug(int id)

    {

        . . .

    }

}

 

The client would use it like so:-

 

// somewhere in main()

var service = BugTrackerProxyFactory.Create();

. . .

// somewhere in the client processing

var bug = service.FetchBug(bugId);

 

What we need to do when passing the request over the wire is to tag our causality data on the end of the existing parameters. To achieve this we have a separate wire-level interface that “extends” the methods of the logical one:-

 

interface IRemoteBugTrackerService

{

    Bug FetchBug(int id, List<Tag> causality);

}

 

Then, inside the client proxy we can hoist the primary causality out of the diagnostic context container and pass it across the wire to the service’s remote stub:-

 

class BugTrackerProxy : IBugTrackerService
{

    public Bug FetchBug(int id)

    {

        var causality = Causality.GetPrimaryCausality();

 

        return m_remoteService.FetchBug(id, causality);

    }

 

    private IRemoteBugTrackerService m_remoteService;

}

 

We then need to do the reverse (inject the primary causality into the new call stack) inside the remote stub on the service side:-

 

class BugTrackerServiceImpl : IBugTrackerService

{

. . .

}

 

class RemoteBugTrackerService : IRemoteBugTrackerService
{

    public Bug FetchBug(int id, List<Tag> causality)

    {

        using (Causality.SetPrimaryCausality(causality))

        {

            return m_service.FetchBug(id);

        }

    }

 

    private BugTrackerServiceImpl m_service;

}

 

In this example the client proxy (BugTrackerProxy) and service stub (RemoteBugTrackerService) classes merely provide the mechanism for dealing with the non-functional data. Neither the caller nor the service implementation class (BugTrackerServiceImpl) are aware of what’s going on behind their backs.

 

In fact, as a double check that concerns are correctly separated, we should be able to invoke the real service implementation directly instead of the client proxy and still get the same primary causality appearing in our log output:-

 

//var service = BugTrackerClientFactory.Create();

var service = new BugTrackerServiceImpl();

. . .

var bug = service.FetchBug(bugId);

Marshalling the Primary Causality to Other Threads

Marshalling the primary causality from one thread to another can be done in a similar manner as the remote case. The main difference is that you’ll likely already be using your language and runtime library in some way to hide some of the grunge, e.g. by using a delegate/lambda. You may need to give this up slightly and provide the proverbial “extra level of indirection” by wrapping the underlying infrastructure so that you can retrieve and inject the causality around the invocation of the business logic. Your calling code should still look fairly similar to before:-

 

Job.Run(() =>

{

    backgroundTask.Execute();

});

 

However instead of directly using Thread.QueueUserWorkItem we have another static façade (Job) that will marshal the causality behind the delegate’s back:-

 

public class Job : IRunnable

{

    public static void Run(Action action)

    {

        var causality = Causality.GetPrimaryCausality();

 

        ThreadPool.QueueUserWorkItem((o) =>

        {

            using (Causality.SetPrimaryCausality(causality))

            {

                action();

            }

        });

    }

}

Marshalling Via Exceptions

In the previous two sections the marshalling was very much one-way because you want to unwind the diagnostic context as you return from each scope. But there is another way to marshal the causality, which is via exceptions. Just as an exception in .Net carries around a call stack for the point of the throw and any inner exceptions, it could also carry the causality too. This allows you to avoid one common (anti) pattern which is the “log and re-throw”:-

 

try

{

    // Read a file

}

catch (Exception e)

{

    Log.Error(“Failed to read file ‘{0}’”, filename);

    throw;

}

 

The only reason the try/catch block exists is to allow you to log some aspect of the current operation because you know that once the call stack unwinds the context will be gone. However, if the exception captured the causality (or even the entire diagnostic context) in its constructor at the point of being thrown this wouldn’t be necessary. You also won’t have a “spurious” error message either when the caller manages to completely recover from the exception using other means.

 

public class CustomException : Exception

{

    public CustomException(string message)

        : base(message)

    {

        m_causality = Causality.GetPrimaryCausality();

    }

 

    private List<Tag> m_causality;

}

 

Naturally this only works with your own exception classes, and so you might end up catching native exceptions anyway and re-throwing your own custom exception types just to capture the causality. However, you’ve avoided the potentially spurious log message though so it’s still a small net gain.

 

If the exception flows all the way back to the point where the transaction started you can then log the captured causality with the final exception message. In some cases this might be enough to allow you to diagnose the problem without having to find the local context where the action took place.

Tag Types

So far we’ve restricted ourselves to simple string based tags. But there is no reason why you couldn’t store references to the actual business objects and use runtime type identification (RTTI) to acquire an interface for handling causality serialization and formatting. If all you’re doing is rendering to a simple log file though this might be adding an extra responsibility to your domain types that you could do without.

 

This is one area where I’ve found Extension Methods in C# particularly useful because they only rely on the public state of an object and you can keep them with the infrastructure side of the codebase. The calling code can then look like this:-

 

using (customer.TagCausality())

{

    // Do something with customer

}

 

The extension method can then hide the “magic” tag key string:-

 

public static class CustomerExtensions

{

    public Tag TagCausality(this Customer customer)

    {

        return Causality.Attach(“Customer”, customer.Id);

    }

}

Keeping the Noise Down

Earlier I suggested that it’s worth differentiating between the primary and ancillary tags to keep the noise level down in your logs as you traverse the various layers within the system. This could be achieved either by keeping the tags in a separate container which are then merged during formatting, or marking them with a special flag. The same suggestion applies to your context interface/façade – separate method names or an additional flag, e.g.

 

using (Causality.AttachPrimary(“ID”, Id))

…versus…

 

using (Causality.Attach(“ID”, Id, Causality.Primary))

…versus…

 

using (Causality.Attach(“ID”, Id))

using (Causality.MarkPrimary(“ID”))

Whatever you decide it will probably be the choice that helps you keep the noise level down in your code too. Just as we wanted to keep the marshalling logic away from our business logic, we might also choose to keep our diagnostic code separate too. If you’re using other tangential patterns, such as the Big Outer Try Block [Longshaw], or measuring everything you can afford to [Oldwood], you’ll find weaving this aspect into your code as well might only succeed in helping you to further bury the functional part:-

 

public void ProcessAdditionRequest(Request request)

{

    try

    {

        using (Causality.Attach(“Request”, request.Id))

        using (Causality.Attach(“User”, request.Login))

        using (Causality.Attach(“Host”, request.Host))

        {

            using (Instrument.MeasureElapsedTime(“Addition”))

            {

                request.Answer = request.Left + request.Right;

            }

        }

    }

    catch (MyException e)

    {

        // Recover from known problem

    }

    catch (Exception e)

    {

        // Recover from unknown problem

    }

}

 

Most of the boilerplate code can be parcelled up into a helper method that takes a delegate/lambda so that the underlying functionality shines through again:-

 

public void ProcessAdditionRequest(Request request)

{

    HandleRequest(request, () =>

    {

        request.Answer = request.Left + request.Right;

    });

}

 

private void HandleRequest(Request request, Action action)
{

    try
    {

        using (Causality.Attach(“Request”, request.Id))

        {

            action();

        }

    }

    catch (MyException e)

    {

        // Recover from known problem

    }

    catch (Exception e)

    {

        // Recover from unknown problem

    }

}

Testing Causality Interactions

Due to the simplistic nature of the way the context is implemented it is an orthogonal concern to any business logic you might be testing. As the example implementation shows it is also entirely stateful and so there are no mocking concerns unless you want to explicitly test that the context itself is being correctly manipulated. Given that the modus operandi of the diagnostic context is to allow you to extract the tags for your own use the public API should already provide everything you need. This assumes of course that the operation you’re invoking provides you with a “seam” [Feathers] through which you can observe the effects, e.g.

 

public class TestService : IService

{

    public void DoSomething()

    {

        m_causality = Causality.GetPrimaryCausality();

    }

 

    public List<KeyValuePair<string, string>> m_causality;

}

 

[Test]

public void RequestShouldSetPrimaryCausality()

{

    var service = new TestService();

    var request = new Request(service);

 

    request.ProcessIt();

 

    Assert.That(service.m_causality.Count, Is.EqualTo(1));

    Assert.That(service.m_causality[0].Key, Is.EqualTo(“ID”));

    Assert.That(service.m_causality[0].Value, Is.EqualTo(“1234”));

}

Summary

This article demonstrated a fairly unobtrusive way to associate arbitrary tags with a logical thread of execution to aid in the diagnosis and support of system issues via log files. It then illustrated a mechanism to pass the primary tags to other threads and remote processes so that multiple distributed scopes could be related to one another. The latter part contained some ideas on ways to reduce the noise in the client code and finished with a brief comment on what effects the mechanism has on unit testing.

Acknowledgements

A large dollop of thanks goes to Tim Barrass for providing me with the impetus to write this up in the first place and for forgoing sleep to review it too. Steve Love also picked up many problems both big and small. Finally the Overload team (Frances Buontempo, Roger Orr and Matthew Jones) helped to add some spit and polish.

References

[Ewald] Transactional COM+: Building Scalable Applications, Tim Ewald

[Feathers] Working Effectively With Legacy Code, Michael Feathers

[Harrison] Pattern Languages of Program Design 3, edited by Robert C. Martin, Dirk Riehle and Frank Buschmann.

[Longshaw] The Generation, Management and Handling of Errors (Part 2), Andy Longshaw and Eoin Woods, Overload 93

[Nygard] Michael Nygard, Release IT!

[OED] Oxford English Dictionary

[Oldwood] Chris Oldwood, Instrument Everything You Can Afford To, http://chrisoldwood.blogspot.co.uk/2012/10/instrument-everything-you-can-afford-to.html

 

 

Chris Oldwood

06 March 2013

 

Bio

Chris started out as a bedroom coder in the 80s, writing assembler on 8-bit micros. These days it’s C++ and C# on Windows in big plush corporate offices. He is also the commentator for the Godmanchester Gala Day Duck Race and can be contacted via gort@cix.co.uk or @chrisoldwood.