5

I am just starting with OpenTelemetry and have created two (micro)services for this purpose: Standard and GeoMap.

The end-user sends requests to the Standard service, who in turn sends requests to GeoMap to fetch informations before returning the result to the end-user. I am using gRPC for all communications.

I have instrumented my functions as such:

For Standard:

type standardService struct {
    pb.UnimplementedStandardServiceServer
}

func (s *standardService) GetStandard(ctx context.Context, in *pb.GetStandardRequest) (*pb.GetStandardResponse, error) {

    conn, _:= createClient(ctx, geomapSvcAddr)
    defer conn1.Close()

    newCtx, span1 := otel.Tracer(name).Start(ctx, "GetStandard")
    defer span1.End()

    countryInfo, err := pb.NewGeoMapServiceClient(conn).GetCountry(newCtx,
        &pb.GetCountryRequest{
            Name: in.Name,
        })

    //...

    return &pb.GetStandardResponse{
        Standard: standard,
    }, nil

}

func createClient(ctx context.Context, svcAddr string) (*grpc.ClientConn, error) {
    return grpc.DialContext(ctx, svcAddr,
        grpc.WithTransportCredentials(insecure.NewCredentials()),
        grpc.WithUnaryInterceptor(otelgrpc.UnaryClientInterceptor()),
    )
}

For GeoMap:

type geomapService struct {
    pb.UnimplementedGeoMapServiceServer
}

func (s *geomapService) GetCountry(ctx context.Context, in *pb.GetCountryRequest) (*pb.GetCountryResponse, error) {

    _, span := otel.Tracer(name).Start(ctx, "GetCountry")
    defer span.End()

    span.SetAttributes(attribute.String("country", in.Name))

    span.AddEvent("Retrieving country info")

    //...
    
    span.AddEvent("Country info retrieved")

    return &pb.GetCountryResponse{
        Country: &country,
    }, nil

}

Both services are configured to send their spans to a Jaeger Backend and share an almost identic main function (small differences are noted in comments):

const (
    name        = "mapedia"
    service     = "geomap" //or standard
    environment = "production"
    id          = 1
)

func tracerProvider(url string) (*tracesdk.TracerProvider, error) {
    // Create the Jaeger exporter
    exp, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint(url)))
    if err != nil {
        return nil, err
    }
    tp := tracesdk.NewTracerProvider(
        // Always be sure to batch in production.
        tracesdk.WithBatcher(exp),
        // Record information about this application in a Resource.
        tracesdk.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceName(service),
            attribute.String("environment", environment),
            attribute.Int64("ID", id),
        )),
    )
    return tp, nil
}

func main() {

    tp, err := tracerProvider("http://localhost:14268/api/traces")
    if err != nil {
        log.Fatal(err)
    }

    defer func() {
        if err := tp.Shutdown(context.Background()); err != nil {
            log.Fatal(err)
        }
    }()
    otel.SetTracerProvider(tp)

    listener, err := net.Listen("tcp", ":"+port)
    if err != nil {
        panic(err)
    }

    s := grpc.NewServer(
        grpc.UnaryInterceptor(otelgrpc.UnaryServerInterceptor()),
    )
    reflection.Register(s)
    pb.RegisterGeoMapServiceServer(s, &geomapService{}) // or pb.RegisterStandardServiceServer(s, &standardService{})
    if err := s.Serve(listener); err != nil {
        log.Fatalf("Failed to serve: %v", err)
    }
}

When I look at a trace generated by an end-user request to the Standard Service, I can see that it is, as expected, making calls to its GeoMap service:

Standard trace

However, I don't see any of the attributes or the events I have added to the child span (I added an attribute and 2 events when instrumenting the GetCountry function of GeoMap).

What I notice however is that these attributes are available in another separate trace (available under the "geomap" service in Jaeger) with a span ID totally unrelated to the child spans in the Standard service:

geomap trace

Now what I would have expected is to have a single trace, and to see all attributes/events related to GeoMap in the child span within the Standard span. How to get to the expected result from here?

Mit94
  • 718
  • 8
  • 28

1 Answers1

2

The span context (which contains trace ID and span ID, as described in "Service Instrumentation & Terminology") should be propagated from the parent span to the child span in order for them to be part of the same trace.

With OpenTelemetry, this is often done automatically by instrumenting your code with the provided plugins for various libraries, including gRPC.
However, the propagation does not seem to be working correctly in your case.

In your code, you are starting a new span in the GetStandard function, and then using that context (newCtx) when making the GetCountry request. That is correct, as the new context should contain the span context of the parent span (GetStandard).
But the issue might be related to your createClient function:

func createClient(ctx context.Context, svcAddr string) (*grpc.ClientConn, error) {
    return grpc.DialContext(ctx, svcAddr,
        grpc.WithTransportCredentials(insecure.NewCredentials()),
        grpc.WithUnaryInterceptor(otelgrpc.UnaryClientInterceptor()),
    )
}

You are correctly using the otelgrpc.UnaryClientInterceptor here, which should ensure that the context is propagated correctly, but it is not clear when this function is being called. If it is being called before the GetStandard function is invoked, then the context used to create the client will not include the span context from GetStandard.

For testing, try and make sure that the client is created after the GetStandard function is invoked, and the same context is used throughout the request.

You can do this by passing the newCtx directly to the GetCountry function, as illustrated with this modified version of your GetStandard function:

func (s *standardService) GetStandard(ctx context.Context, in *pb.GetStandardRequest) (*pb.GetStandardResponse, error) {
    newCtx, span1 := otel.Tracer(name).Start(ctx, "GetStandard")
    defer span1.End()

    conn, _:= createClient(newCtx, geomapSvcAddr)
    defer conn.Close()

    countryInfo, err := pb.NewGeoMapServiceClient(conn).GetCountry(newCtx,
        &pb.GetCountryRequest{
            Name: in.Name,
        })

    //...

    return &pb.GetStandardResponse{
        Standard: standard,
    }, nil
}

Now, the context used to create the client and make the GetCountry request will include the span context from GetStandard, and they should appear as part of the same trace in Jaeger.

(As always, do check the returned errors from functions like createClient and GetCountry, not shown here for brevity).


In addition:

  • Check also your propagator: Make sure you are using the same context propagator in both services, preferably the W3C TraceContextPropagator, which is the default one in OpenTelemetry.

    You can set the propagator explicitly as follows:

    otel.SetTextMapPropagator(propagation.TraceContext{})
    

    Add the above line to the beginning of your main function in both services.

  • Ensure metadata is being passed: The gRPC interceptor should automatically inject/extract the tracing context from the metadata of the request, but double-check to make sure it is working properly.

    After starting a span in your GetCountry function, you can log the trace ID and span ID:

    ctx, span := otel.Tracer(name).Start(ctx, "GetCountry")
    sc := trace.SpanContextFromContext(ctx)
    log.Printf("Trace ID: %s, Span ID: %s", sc.TraceID(), sc.SpanID())
    defer span.End()
    

    And do the same in your GetStandard function:

    newCtx, span1 := otel.Tracer(name).Start(ctx, "GetStandard")
    sc := trace.SpanContextFromContext(newCtx)
    log.Printf("Trace ID: %s, Span ID: %s", sc.TraceID(), sc.SpanID())
    defer span1.End()
    

    The trace IDs in the two services should match if the context is being propagated correctly.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • Thank you but unfortunately I am still getting two different traces, and the attributes set on the _GeoMap_ span are still not shown on the _Standard_ Trace. – Mit94 Jul 05 '23 at 12:34
  • @Mit94 OK. I have added some checks for you to test and validate. – VonC Jul 05 '23 at 12:56
  • Yes setting explicitly the propagator to `otel.SetTextMapPropagator(propagation.TraceContext{})` did it for me! Before trace IDs were different but now they match in both services. I wonder how this is not default behaviour. Thank you! – Mit94 Jul 05 '23 at 13:13
  • 1
    @Mit94 Good catch, well done! Note: the environment variable [`OTEL_PROPAGATORS`](https://opentelemetry.io/docs/concepts/sdk-configuration/general-sdk-configuration/#otel_propagators) can be used to specify the propagators, and if this was set to something else (e.g., `b3multi`), it would have overridden the default. So while it should be the default behavior, in some cases you might still need to manually set it, like you did. It is always a good idea to specify such important configurations explicitly to avoid potential issues like this. – VonC Jul 05 '23 at 13:19