2

I'm doing system design interview prep and was practicing with a "designing WhatsApp" question. One of the things I'm still unclear on is how a system like WhatsApp maintains order in message delivery. I'll give my high-level understanding:

Message delivery needs to be causal but doesn't need to be linearizable. That is, concurrent messages (where concurrency is defined as messages between "typing" and "ack'd by server") are incomparable and can be delivered in any order.

Ordering/causality can apply to intra-group messages (the most prominent case) or inter-group messages (e.g., if two users are part of two different shared chats with other users).

Most designs I've seen either don't explicitly address ordered message delivery or address it with a per-user message queue (e.g., https://www.youtube.com/watch?v=WzBzYX1aSrU&t=0s). This per-user queue could be either be an on-server queue (e.g., RabbitMQ) or something like a FIFO SQS queue with user_id as the message group id (though the SQS approach might not scale given FIFO SQS constraints). However, it's not clear to me how a per-user queue completely solves the message ordering question. A per-user queue ensures ordered delivery of messages but does nothing to ensure that messages are placed on a given user's queue in the correct order.

There are two main ways I can think of to avoid this. One is to assign message workers on a per-group basis. Then we have an architecture where for each user, there is a single gateway assigned to handle that user/client's messages (send and receive - Gaurav Sen discusses this approach in https://www.youtube.com/watch?v=vvhC64hQZMk). For each group, there is a single message processor assigned to handle that group's messages. In this way, the message processor will push a given message to each user's gateway queue (or to temp storage if that user is offline) before it processes another message.

The other way to address this would be a total order broadcast (though perhaps some sort of total order multicast could be used). However, I don't necessarily see how this would scale, as it's not clear that multicast would work (and broadcast doesn't seem like it would scale).

TLDR; when designing a messaging app, it seems that ordered message delivery defines a partial order that requires both a per-user queue (for delivering messages in order) and a per-group queue (for putting messages on each recipient's queue in the same order).

Adrian K
  • 9,880
  • 3
  • 33
  • 59
rbapq
  • 21
  • 1
  • 3

1 Answers1

2

I have no idea how WhatsApp works, but regarding your question:

One of the things I'm still unclear on is how a system like WhatsApp maintains order in message delivery

One approach, which you discuss, is queues and message management.

Another is time. Using timestamps (e.g. UTC) a system can see how messages relate in terms of time (e.g. when they were first arrived at the backend), and therefore how they should be displayed, based on the timestamp. Imagine clients pull whatever messages exist for them in batches, you don't need to worry about getting the order right because the consumer can just work off the common timestamp.

Remember that (unless they use some custom networking) apps will be using HTTPS over the internet; maintaining a continuious connection (like a WebSocket) is possible but I would assume most use follow HTTP, which is request/response based - meaning that the client askes the server for information - the server can't push it out unrequested.

Going back to timestamps, in such an approach, the backend / server-systems can give the client software any and all messages they have at the time, and they know they don't need to worry about dealing with message order because the overall system design delegates that responsibility to the client, based on the timestamp. A potential advantage of such an approach is simplicity - removing the need to manage message ordering, and the complexity that goes with that.

Regarding this:

However, it's not clear to me how a per-user queue completely solves the message ordering question. A per-user queue ensures ordered delivery of messages but does nothing to ensure that messages are placed on a given user's queue in the correct order.

It's entirely possible that the user queue is all about delivery, and nothing to do with correct ordering. As a general rule, parts of a system should only have one job, which they do really well. This usually makes for a better design: easier to maintain, and more reliable to operate.

Adrian K
  • 9,880
  • 3
  • 33
  • 59