English Distributed tracing with Quarkus, Python and OpenTelemetry

This article was first published as a 3 Part series on https://pilhuhn.medium.com/

In a microservices world, there is no easy way of attaching a debugger or profiler to one component to understand why it exposes a certain behaviour. Profiling one component may reveal that it is someone elses problem. I have worked in the past with OpenTracing, but as this is now deprecated and OpenTelemetry is the new cool kid on the block I wanted to have a look. The following diagram shows the overall setup for this post:

Part 1

Drawing of the overall scenation with 3 Microservices

This setup is partially a real use case in the sense that I have a Quarkus based application “backend”, which calls a Python server “fake-rbac”. To make things more interesting and for the sake of learning, I am also making an outbound call from the Python app to another Quarkus based service “engine”. Let’s get started with the backend Quarkus app

Adding OpenTelemetry to my Quarkus backend app

The Quarkus guide on OpenTelemetry is a good starting point. As I already had the basic code, I went ahead and added the following dependencies to my pom.xml file:

<dependency>
    <groupId>io.quarkus</groupId>
    <artifactId>quarkus-opentelemetry</artifactId>
</dependency>
<dependency>
    <groupId>io.quarkus</groupId>
    <artifactId>quarkus-opentelemetry-exporter-otlp</artifactId>
</dependency><dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-extension-trace-propagators</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry.instrumentation</groupId>
    <artifactId>opentelemetry-jdbc</artifactId>
</dependency>
<dependency>
    <groupId>io.opentelemetry</groupId>
    <artifactId>opentelemetry-extension-annotations</artifactId>
</dependency>

This looks like a lot and I wish there were just one quarkus-opentelemetry-all extension that includes all of the above, but at the e nd it does not matter, as I had to add this once and am done with it.

The next step is now to add some settings in application.properties and then we are already ready to go. First we set the location of our trace collector, which will be Jaeger in the new OpenTelemetry Line Protocol (OTLP) native mode:

quarkus.application.name=backend
quarkus.opentelemetry.enabled=true
quarkus.opentelemetry.tracer.exporter.otlp.endpoint=http://127.0.0.1:4317

Next change the database connection to support tracing as well:

quarkus.datasource.db-kind=postgresql
quarkus.datasource.jdbc.url=jdbc:otel:postgresql://127.0.0.1:5432/postgres
quarkus.datasource.jdbc.driver=io.opentelemetry.instrumentation.jdbc.OpenTelemetryDriver

Starting Jaeger

Jaeger supports OTLP natively since version 1.35. As I am using docker-compose on my laptop to start a bunch of services, I have the following entry, which enables OTLP and opens ports 4317/4318 (for grpc and http transport respectively). Check out the Jaeger docs for more deployment options.

jaeger:
  image: jaegertracing/all-in-one:1.37.0
  ports:
    - 16686:16686
    - 4317:4317
    - 4318:4318
  environment:
    COLLECTOR_OTLP_ENABLED: true

When we now hit the REST-endpoint on backend, go to the Jaeger UI on Port 16686, search for traces and hit the existing one, we will see a visualisation like the following:

Jaeger visualisation of a trace in a single process
Jaeger visualisation of a trace in a single process

We see that the total processing took 26ms and that there were 2 Database queries triggered by this. We also see that some RbacFilter was called. I’ll talk about that filter below.

Automatic instrumentation by the framework(s)

The interesting fact here is that without any code changes, we get the timing for the called method including timings for the database calls. OpenTelemetry offers a large range of instrumentations for commonly used frameworks. Above we have pulled in the dependency of io.opentelemetry.instrumentation:opentelemetry-jdbc and have changed the jdbc url slightly to include otlp in it and thus get the data for our database calls. In fact when we click on the Span for one of the calls to expand the details we can even see what query was fired:

Database query details in the Jaeger UI
Database query details in the Jaeger UI

Automatic Instrumentations do not only exist for Quarkus (or Java in general), but for a large variety of programming languages and frameworks.

Now onto Python

In my project I need to call an RBAC (role based access control) server, that is written in Python. As I wanted to understand what it takes to also enable tracing in that server, I started to implement a simple http-server with the built-in Python http.server package (I am well aware that other servers exist and that even the OpenTelemetry Python Getting Started uses Flask).

I have decided to go with manual instrumentation despite the existence of opentelemetry-instrument to make it more obvious what bits and pieces are involved for e.g. instrumenting your own libraries.

I am not going to share the full code here, you can find it on GitHub. Instead I will focus on the core pieces.

One of the first things we have to do is to get a tracer instance from the globally configured factory (we come to that in a moment):

from opentelemetry import tracetracer = trace.get_tracer(__name__)

We can then use this in the servers handler method (do_GET):

def do_GET(self):
    with tracer.start_as_current_span("do-get-span") as span:
        self.send_response(200)
        span.set_attribute("type", "json")

Here we start a span (basically one of the brown bars in the above Jaeger visualisation) from the tracer and then set an attribute on the span. Processing then just continues normally and the span is ended once do_GET is finished.

Send the data to Jaeger

OpenTelemetry separates the data acquisition via the OpenTelemetry API (that we have just seen) from the part that acts on the data like exporting, batching or sampling in the OpenTelemetry SDK. This means just with the above OpenTelemetry does not yet know what to do with the span and how to export it to Jaeger. So we have to configure the exporter before acquiring the data:

# Set up exporting
#  First set the name of our service
resource = Resource(attributes={ SERVICE_NAME: "fake-rbac" })# Configure the provider with the service
nameprovider = TracerProvider(resource=resource)# Set up the OTLP exporter as processor
processor = BatchSpanProcessor(
            OTLPSpanExporter(endpoint="http://localhost:4317"))# Tell OTEL to use this
processorprovider.add_span_processor(processor)# And finally tell the tracer to use this 
providertrace.set_tracer_provider(provider)tracer = trace.get_tracer(__name__)

First we need to set the name of the service. It would work without, but Jaeger would just show unknown_service which is not what we want. We then set up the exporter. In this case I have chosen the OTLP-Grpc exporter. There is also a http/protobuf variant, that may be desirable in case you have issues with your grpc library. For that variant you would configure the endpoint to use port 4318 as e.g. in http://localhost:4318/v1/traces.

When I now use curl to access my service I see a nice trace with one span in Jaeger:

Trace for our service in the overview
Trace for our service in the overview

Let’s concatenate calls

Now let’s trigger the backend to make a call to the Python service and see what happens

Trace that only shows spans from policies-backend
Trace that only shows spans from policies-backend

I see the trace from backend, but no mention of the fake-rbac Python service. The reason for this is that we haven’t told our Python code yet how to deal with the propagation of the trace-id.

Part 2

Above we looked at instrumenting some Quarkus and Python code to obtain traces, which we forwarded to Jaeger. Unfortunately, the two were not connected and thus we could not get the full picture. Let’s fix that.

A trace with some embedded spans
A trace with some embedded spans

Each trace as well as each span has a unique ID. Spans have in addition also the traceId and the id of their parent span (you can imagine a trace as a span with no parent). Span 3 would have Span 2 as parent in above picture and both Span 1 and Span 2 have the blue trace as their common parent. The blue trace itself does not have a parent.

This means to be able to link our Python code to the trace started by the backend, we need to tell it the trace id and also the id of the span, that is supposed to be the parent span of the Python code. A common way to do this is to propagate this information in a http header. There are several standards to do this, so a new standard was invented, the W3C Tracecontext. Here a single header traceparent is sent in http-requests. This header contains all the necessary information and looks like the following (refer to the standard for full information):

version2-traceId32-parentId16-flags2

The header value consists of four fields, separated by a dash. Each field is represented by a hexadecimal number, where in the above the number of digits is given. With that information we can now extract the header from the incoming request and create a SpanContext from it, which is then passed on when we create the Span in the requesthandler:

inc_trace = self.headers["traceparent"]
ctx = {}span_context = None
if inc_trace is not None:
    span_context = extract_trace_data(inc_trace)
    ctx = trace.set_span_in_context(NonRecordingSpan(span_context))
with tracer.start_as_current_span("my-span",context=ctx) as span:
    self.send_response(200)
    ...

We first get the http-header and then pass it to the extractor code, which creates a SpanContext object, which is then used when we start the new span to in our handler method. With that change in place we can trigger another run on backend and indeed get the expected result in Jaeger:

Overview of a trace including fake-rbac and backend
Overview of a trace including fake-rbac and backend

In the detail view we can also clearly see the span for fake-rbac in orange

Trace that now also includes fake-rbac
Trace that now also includes fake-rbac

Propagation from Python to Quarkus

The previous success leaves us with the propagation out of Python into the second Quarkus service as shown in the first drawing of the first post. And as we now know how we could do it by hand, I am using a shortcut and am using the requests library instrumentation. The setup in code is pretty simple: just add the following line before setting up the TraceProvider and you are done.

RequestsInstrumentor().instrument()

Another request to backend yields this result:

More complex trace with involved services
More complex trace with involved services

We see all 3 services, but the overview shows 2 errors. What is going on? Let’s look at the detail of the trace and expand the two spans with the red exclamation mark.

Trace view with expanded details
Trace view with expanded details

We can see in the trace view that the PUT request of fake-rbac to the engine resulted in a 400 ‘invalid request’ return code (marked in green). The instrumentation for the requests library thus marks the span as error.

Marking spans as errors

When we look at the span of the engine, which returned the 400 code, we can see that it is also marked as error and that it contains a log message that tells us why the request failed. For this to happen, the engine server code needs a bit of instrumentation:

} catch (Exception e) {
    Span span = Span.current();
    span.recordException(e);
    span.setStatus(StatusCode.ERROR);
    span.setAttribute("condition", condition);

This retrieves the current span and then records the exception. This does not mark the span automatically as error, so we explicitly do so. And in the last step the condition element that caused the exception is stored as attribute so that it would show up in the tag of the span, which it does above with an empty value.

Part 3

The last two parts of this article focussed on Tracing basics and instrumentation of Quarkus and Python code for http requests. A very popular mechanism for data transfer is Apache Kafka. In this post, we’ll look at how to instrument code for use with Kafka and get trace propagation working. The general scenario is similar to the previous, except that for the Quarkus part, I use one server only:

Setup of this part

Quarkus is configured for tracing in pom.xml and application.properties as described in the first part of the series. The sender code to send data to topic1 then looks like the following(the messaging getting-started guide has more information):

@Inject
@Channel("topic1")Emitter<String> emitter;
@GETpublic String doSend() {
    emitter.send("Hello World");
    return "done";
}

When I call the endpoint via curl (The @GET indicates that this is a REST-endpoint), a message with a payload of “Hello World” is sent to topic1 on Kafka. When looking at the sent message via kafkacat we can see the W3C trace propagation header, which we already discussed in the previous part:

$ kcat -C -t topic1 -b localhost:9092 -o beginning -J
{
  "topic":"topic1",
  "partition":0,
  "offset":0,
  "tstype":"create",
  "ts":1660559408201,
  "broker":0,
  "headers":[
    "traceparent","00-08b53bb8eb480e0abda9acb0ccb3e636-755009f86a258b5f-01"
  ],
  "key":null,
  "payload":"Hello World"
}

Note, that kafkacat only shows the payload unless a formatting option like -J is provided.

Now onto the Python code

Our Python code to start with is pretty simple. We start a consumer on topic1 and then print incoming messages:

from kafka import KafkaConsumer
consumer = KafkaConsumer('topic1')
for msg in consumer:
    print(msg)

Running the code will print a message on console like the following, where we can see the passed traceparent header and the body of the message:

ConsumerRecord(topic='topic1',
               partition=0,
               offset=9,
               timestamp=1660563030653,
               value=b'Hello World',
               headers=[      ('traceparent', b'00-8a6a62d94e611819d4f7a3523d95542b-5fd53b80e46d4ffa-01')  ]  ...   )

We could now extract the header and create the SpanContext as seen in the http-example etc, but this time we take the easy route and use the existing Kafka instrumentation for OpenTelemetry:

from opentelemetry.instrumentation.kafka import KafkaInstrumentorKafkaInstrumentor().instrument()

This sets up all the necessary magic, so that our code can be pretty crisp:

consumer = KafkaConsumer('topic1')
producer = KafkaProducer()
for msg in consumer:
    with tracer.start_as_current_span("do-the-work-span") as span:
        # do the work
        body = msg.value.decode('utf-8')
        body = body + ' from Python'
        # and send it off to topic2
        producer.send('topic2', body.encode('utf-8'))

Triggering a Kafka-message coming from Quarkus (and the Quarkus receiver already in place) we get the following display in Jaeger:

Jaeger showing 2 traces

Wait, this is not what we wanted. We see “enough” data, but those traces should be one. Some looking around and debugging indicates that the Kafka instrumentation does not (yet as of v1.11) propagate the received traceId as our parent trace.

There is help

With the traceparent in the header and the knowledge from Part 2 of the series, we can easily fix this:

# look for traceparent header and return its value
trace_parent = get_trace_parent_header(msg)
# Create a SpanContext object from the header value
span_context = extract_trace_data(trace_parent)
# Use this SpanContext as parent
ctx = trace.set_span_in_context(NonRecordingSpan(span_context))
with tracer.start_as_current_span("do-the-work-span", context=ctx) as span:
  # do the work ...

And now it looks like it is supposed to be:

Full trace of tracing a message over Kafka

The Quarkus receiver

The last missing piece is the receiving side in Quarkus. And it is as simple as:

@Incoming("topic2")
void process(String message) {
    System.out.println("Got a message: " + message);
}

Yep, that’s it. The instrumentation is doing everything for us.

Resend the message within Quarkus

Like in the Python case, I now wanted to forward the incoming message again. The setup stays largely the same, but within the Quarkus process, there will be a check on the received message and then eventually forwarded to k-serv again:

New slightly changed setup
New slightly changed setup

When the trigger is hit, message (1) is sent to k-serv, which transforms it and sends message (2) back. Q-serv then checks if it has seen it before and otherwise forwards it as message (3) again to k-serv, which then transforms and re-sends again (4).

As in the Python example, we want the forwarding in Quarkus to also propagate the trace-information to get a full picture.

Unfortunately this propagation is not yet automatic (as of Quarkus 2.11) and we need to do a little work ourselves. First we need to change the signature of the method:

@Incoming("topic2")CompletionStage<Void> 
process(Message<String> message) {

This allows us to retrieve the headers and thus the traceparent:

Optional<TracingMetadata> optionalTracingMetadata = TracingMetadata.fromMessage(message);
if (optionalTracingMetadata.isPresent()) {
    TracingMetadata tracingMetadata = optionalTracingMetadata.get();

We then use this to set the current context for unit of work. Using the try-with-resources block ensures that the scope is closed at the end which is required by the Context API.

try (Scope _ignored =
           tracingMetadata.getCurrentContext().makeCurrent()) {

If the message should be resend (Number (3) above), we now create a new Message to which we attach the Tracing info and send it off:

// Get the bodyString 
body = message.getPayload();// We need to use a Message to emit with headers.
Message<String> out = Message.of(body);.
// Add the tracing metadata to the outgoing message .
headerout = out.addMetadata(.
              TracingMetadata.withCurrent(Context.current()));// And send to round 2.
emitter.send(out);

With these changes we can nicely see how the message is processed:

Screenshot from Jaeger UI showing the processing of the message
Screenshot from Jaeger UI showing the processing of the message

Show me the code

Both Python and Quarkus code live on GitHub. The extended usecase of Quarkus forwarding data lives in the forwarding branch.

Thanks goes to Bruno Baptista for feedback on this article.