Shortly - yes. We have done it, and it works pretty well.
The trick is, you have to think deeper than just an API worker applications when it comes to horizontal scaling. If you want push architecture, it needs to be asynchronous from the very beginning.
To achieve it, we used queueing systems, namely RabbitMQ.
Imagine this scenario of report generation, which can take up to 10 minutes:
- Client connects to our GraphQL API (instance 1) via WebSocket
- Client sends a command to generate a report via WebSocket
- API generates token for the command and puts the command to generate a report in CommandQueue (in RabbitMQ), returning the token to Client.
- Client subscribes to events of its command result, using the token
- Some backend Worker picks up the command and executes the report generation procedure
- During this time GraphQL API (instance 1) dies
- Client automatically reconnects to GraphQL API (instance 2)
- Client renews the subscription with the previously acquired token
- The Worker is done, results on the EventsQueue (RabbitMQ)
- ALL of our GraphQL instances receive information on the
ReportGenerationDoneEvent
and check if anybody is listening for its token.
- GraphQL API (instance 2) sees that Client is awaiting results. Pushes the results via websockets.
- GraphQL API (instances 3-100) ignore the
ReportGenerationDoneEvent
.
It is quite a bit extensive, but with simple abstractions, you do not have to think about all this complexity and write ~30 lines of code across several services for a new process using this route.
And what is brilliant about it, you end up with nice horizontal scaling, event replayability (retries), separation of concerns (client, api, workers), push out the data as quickly as possible to the client, and as you mentioned you do not waste bandwidth on the are we done yet?
requests.
Another cool thing is, that whenever the user opens reports list within our panel, he sees currently generating reports, and can subscribe to their changes, so they do not have to refresh the list manually.
Good thinking on the SocketCluster. It would optimize step 10 in above scenario, but for now, we do not see any performance issues with broadcasting the ReportGenerationDoneEvent
to the whole API cluster. With more instances or multi-region architecture, it would be a must, as it would allow for better scaling and sharding.
It is important to understand that SocketCluster operates on the layer of communication (WebSockets), but the logical API layer (GraphQL) is above that. To make a GraphQL Subscription, you just have to use a communication protocol that allows you to push information to the user, and WebSockets allow that.
I think using SocketCluster is a good design choice, but remember to iterate with implementation. Only use SocketCluster when you plan to have many sockets open at any single point in time. Also, you should subscribe only when necessary, because WebSocket is stateful and requires management and heartbeats.
If you are further interested in asynchronous backend architecture I used above, read up on CQRS and Event Sourcing patterns.