API design for fire and forget endpoints

Question

I’m currently maintaining a few HTTP APIs based on the standard library and gorilla mux and running in kubernetes (GKE).

We’ve adopted the http.TimeoutHandler as our “standard” way to have a consistent timeout error management. A typical endpoint implementation will use the following “chain”:

MonitoringMiddleware => TimeoutMiddleware => … => handler

so that we can monitor a few key metrics per endpoint.

One of our API is typically used in a “fire and forget” mode meaning that clients will push some data and not care for the API response. We are facing the issue that

the Golang standard HTTP server will cancel a request context when the client connection is no longer active (godoc)
the TimeoutHandler will return a “timeout” response whenever the request context is done (see code)

This means that we are not processing requests to completion when the client disconnects which is not what we want and I’m therefore looking for solutions.

The only discussion I could find that somewhat relates to my issue is https://github.com/golang/go/issues/18527; however

The workaround is your application can ignore the Handler's Request.Context()

would mean that the monitoring middleware would not report the "proper" status since the Handler would perform the request processing in its goroutine but the TimeoutHandler would be enforcing the status and observability would be broken.

For now, I’m not considering removing our middlewares as they’re helpful to have consistency across our APIs both in terms of behaviours and observability. My conclusion so far is that I need to “fork” the TimeoutHandler and use a custom context for when an handler should not depend on the client waiting for the response or not.

The gist of my current idea is to have:

type TimeoutHandler struct {
    handler Handler
    body    string
    dt      time.Duration

    // BaseContext optionally specifies a function that returns
    // the base context for controling if the server request processing.
    // If BaseContext is nil, the default is req.Context().
    // If non-nil, it must return a non-nil context.
    BaseContext func(*http.Request) context.Context
}

func (h *TimeoutHandler) ServeHTTP(w ResponseWriter, r *Request) {
    reqCtx := r.Context()
    if h.BaseContext != nil {
        reqCtx = h.BaseContext(r)
    }
    
    ctx, cancelCtx := context.WithTimeout(reqCtx, h.dt)
    defer cancelCtx()
    
    r = r.WithContext(ctx)
   ...
    case <-reqCtx.Done():
        tw.mu.Lock()
        defer tw.mu.Unlock()
        w.WriteHeader(499) // write status for monitoring;
        // no need to write a body since no client is listening.
    
    case <-ctx.Done():
        tw.mu.Lock()
        defer tw.mu.Unlock()
        w.WriteHeader(StatusServiceUnavailable)
        io.WriteString(w, h.errorBody())
        tw.timedOut = true
    }

The middleware BaseContext callback would return context.Background() for requests to the “fire and forget” endpoint.

One thing I don’t like is that in doing so I’m losing any context keys written so this new middleware would have strong usage constraints. Overall I feel like this is more complex than it should be.

Am I completely missing something obvious? Any feedback on API instrumentation (maybe our middlewares are an antipattern) /fire and forget implementations would be welcomed!

EDIT: as most comments are that a request for which the client does not wait for a response has unspecified behavior, I checked for more information on typical clients for which this happens.

From our logs, this happens for user agents that seem to be mobile devices. I can imagine that connections can be much more unstable and the problem will likely not disappear. I would therefore not conclude that I shouldn't find a solution since this is currently creating false-positive alerts.

The underlying problem here is the client which relies on unspecified behavior. Is it possible to attack that problem instead? — JimB, Nov 26 '21 at 14:08
I agree that there is a client-side issue however realistically the endpoint is used in contexts that I cannot control at all (SaaS product). — marchelbling, Nov 26 '21 at 14:14
To extend on why I would rather not just rely on clients: with the `TimeoutHandler` behavior, my monitoring will report 503 spikes whenever a bunch of clients disconnect abruptly so I end up with (false positive) alerts that are a pain. — marchelbling, Nov 26 '21 at 14:22
This is a very opinion based question, but I would just add another middleware to the handlers which are asynchronous that: 1) Writes a 202 response to the client. 2) Calls the handler in a new goroutine by passing it the request but with the context wrapped in a [WithoutCancel](https://stackoverflow.com/a/54132324/1440786). 3) Returns from the middleware. — Bracken, Nov 26 '21 at 15:55
"clients will push some data and not care for the API response" this is just plain wrong. The client doesn't know the request was successfully received without a response. Fire and forget usually means you don't wait for an *eventual* result, not that you don't get an *immediate* response. Are you processing requests you should be receiving and queuing? — erik258, Nov 26 '21 at 16:36
@Daniel Farrell: for analytics libraries, I think that the client environment doesn't care for its data being fully received. To answer your question, currently processing is mostly auth{n,z} (which may require an outgoing HTTP request), unmarshaling/light validation and queueing. — marchelbling, Nov 26 '21 at 16:46
@Bracken I think that eventually we'll move to something more "async" as you propose (one possible issue would be that guaranteeing no data loss would be harder e.g. when a pod is removed). The issue I would have now is that some clients would retry on error because the API supports both frontend and backend implementations as well as batching. Then maybe this API is supporting too many scenarii but for now it's not something that I can change. — marchelbling, Nov 26 '21 at 16:53

API design for fire and forget endpoints

0 Answers0