Removing errors from Datadog traces

5 Feb 2022

Jump to heading Problem

I was writing a library that would serialize objects into JSON so that they could automatically be injected into a Python LogRecord, which was pretty straightforward if the object you were dealing with was a native data type, but if the object in question was a Django queryset or a foreign key on a model instance then it would suddenly be possible for innocent users of my library to end up making unintended database queries. This seemed like a bad idea, so I needed to find a way to prevent accidental database calls from happening.

The way that the Django docs recommend that you do this is by installing a wrapper function around your database queries and raising an exception when you need to prevent database access. This sounded perfect, and I was able to base a solution off of the excellent django-zen-queries package.

Everything was running as intended until my database monitoring exploded, alerting me to a massive increase in errors! I was glad I’d put that monitoring in place! We use Datadog at work, and their application monitoring was, quite understandably, treating my DatabaseAccessForbiddenException exactly the same as any other error. I couldn’t proceed with my work if it was going to compromise the database monitoring of my application, and that of any application using my library, so it was time to go digging.

Jump to heading Solution

Whenever a request is made to your service Datadog collects a bunch of information and bundles it into what’s called a “trace”, and then sends it to their servers for processing. Once the trace is processed you can see how long it took in total and drill down into the individual “spans” in the trace – one for every function call, database query and so on – it’s very useful! There didn’t seem to be a way to configure the Datadog application monitoring itself to ignore a specific error once a trace had been ingested, though, so I figured I needed a way to prevent it from being sent in the first place. The documentation for ddtrace, the Python package that provides the tracer that creates the traces, came to my rescue with trace filtering:

It is possible to filter or modify traces before they are sent to the Agent by configuring the tracer with a filters list… The filters in the filters list will be applied sequentially to each trace

Wherever you call tracer.configure you can pass an optional settings keyword argument, which is a list of key-value pairs that can include a list of FILTERS, instances of TraceFilter. This meant I could define my own filter, and in its process_trace method I could take the trace (which is really only a list of Span instances), iterate over the spans, find the ones with this specific error in them and omit them from the trace entirely! A little spelunking in the ddtrace code itself pointed me to Span.error – a toggle denoting whether or not there is an error in a span – and Span.meta, which contains information about the actual error associated with the span, if any. Job done, right?

def process_trace(self, trace: List[Span]) -> List[Span]:
    return [
        span for span in trace
        if not (span.error and span.get_tag('error.type') in ERRORS_WE_CARE_ABOUT)
    ]

Jump to heading Trade-offs

While this does indeed accomplish the goal of removing specific errors from Datadog traces, it’s not without its complications. First off is that treating a trace as a list of spans is fine until you realise that spans can be nested, as the OpenTelemetry docs describe:

Causal relationships between Spans in a single Trace

        [Span A]  ←←←(the root span)
            |
     +------+------+
     |             |
 [Span B]      [Span C] ←←←(Span C is a `child` of Span A)
     |             |
 [Span D]      +---+-------+
               |           |
           [Span E]    [Span F]
Temporal relationships between Spans in a single Trace

––|–––––––|–––––––|–––––––|–––––––|–––––––|–––––––|–––––––|–> time

 [Span A···················································]
   [Span B··············································]
      [Span D··········································]
    [Span C········································]
         [Span E·······]        [Span F··]

So if you omit a span from a trace you may inadvertently be orphaning a span that is a child of the removed span. This is simple to address in your TraceFilter by maintaining a list of IDs for the removed spans and checking each span to see if its parent_id is in the list, in which case you can omit the child as well (and add it to the list). You do potentially run the risk of removing the root span, though, and trying to send a completely empty trace to your telemetry aggregator, which could either cause an error or confuse someone looking at a monitoring UI. In my use case this seemed unlikely, however.

This leads to the second problem: these spans are still going to be reflected in the overall time for the trace itself even if you remove them from the trace, and if you remove a bunch of spans from your trace it may confuse people looking at the trace and trying to work out what’s going on. You could leave the spans in-place but remove the errors from them, but that potentially causes a different issue: if the spans are essentially failures then they are likely to have very short durations, which might throw off aggregate metrics you have around certain database queries by making them look more performant than they really are. You could try to alter the timestamps of all of the spans in the trace to completely remove their impact on the overall duration, but that seemed overkill for my needs. In my use case the spans in question had durations measured in microseconds and I wasn’t removing too many of them, but your mileage may vary.

Finally, it looks like there’s some risk that this workaround may be removed or complicated in the future: there’s a TODO in the library suggesting that the maintainers would like to deprecate Tracer.configure in the future. The TODO was added in 2019 and is still there at the beginning of 2022, but that’s not necessarily a cast-iron guarantee that it won’t change in the future – you have been warned!