How can I serve ML models quickly and with a low latency

Question

Assume a user connects via a Websocket connection to a server, which serves a personalized typescript function based on a personalized JSON file

So when a user connects,

the personalized JSON file is loaded from an S3-lile bucket (around 60-100 MB per user)
and when he types a Typescript/JavaScript/Python code is executed which returns some string a reply and the JSON-like data structure gets updates
when the user disconnects the JSON gets persisted back to the S3-like bucket.

In total, you can think about 10,000 users, so 600 GB in total.

It should

spin up fast for a user,
should be very scalable given the number of users (such that we do not waste money) and
have a global latency of a few tens of ms.

Is that possible? If so, what architecture seems to be the most fitting?

user3666197 · Answer 1 · 2023-05-20T20:46:26.557

Q: "Is that possible?"

Let's make a sketch of a single-user single-transaction end-2-end latency budget composition :

User may spend from about first 1 [ms] if colocated, yet up to 150+ [ms] for sending packet over the live, RTO connection ( Here we ignore all socket initiation & setup negotiations for simplicity )
Server may spend anything above 25+ [ms] for "reading" an auth'd-user specific JSON-formatted string from RAM upon a 1st seeking/indexing of SER/DES-ed string of still string representation of the key:value pairs ( Here we ignore all add-on costs of non-exclusive use of NUMA ecosystem, spent on actual finding, physical reading and cross-NUMA transport of those 60 ~ 100 MB of auth'd-user specific data from a remote, about a TB-sized off-RAM storage into the final destination inside a local CPU-core RAM area for simplicity )
JSON-decoder may spend any amounts of additional time on repetitive key:value tests over the 60 ~ 100 MB data dictionary
ML-model may spend any amounts of additional time on .predict()-method's internal evaluation
Server will spend some additional time for assembling a reply to the user
Network will again add transport latency, principally similar to the one experienced under item 1 above
Server will next spend some additional time for a per-user & per-incident specific modification of the in-RAM, per-user maintained, JSON-encoded 60 ~ 100 MB data dictionary ( This part ought always happen after items above, if UX latency was a design priority )
Server will next spend some additional time on an opposite direction of cross-NUMA exosystem data transport & storage. While mirroring the item 2, this time the data-flow may enjoy non-critical/async/cached/latency masked deferred usage of physical resources' patterns, which was not the case under item 2, where no pre-caching will happen unless some TB-sized, exclusive-use, never-evicted cache footprints are present and reserved end-to-end, alongside the whole data transport trajectory from the local CPU-core in-RAM representation, re-SER-ialisation into string, over all the cross-NUMA exosystem interconnects, to the very last cold-storage physical storage device (which is almost sure will not happen here)

( subtotal ... [ms] for a single-user single-transaction single-prediction )

Let's make a sketch of what else gets wrong once many-users many-transactions reality gets into the ZOO :

a.
All so far optimistic ( having been assumed as exclusive ) resources will start to degrade in processing performance / transport throughputs, which will add and/or increase actually achieved latencies, because concurrent requests will now result in entering blocking states ( both on micro-level like CPU-core LRU cache resupply delays, cross-QPI extended access times to CPU-core non-local RAM areas and macro-level like enqueuing all calls to wait before a local ML-model .predict()-method is free to run, none of which were present in the non-shared single-user single-transaction with unlimited exclusive resources usage above, so never expect a fair split of resources )

b.
Everything what was "permissive" for a deferred ( ALAP ) write in the items 7 & 8 above, will now become a part of the end-to-end latency critical-path, as also the JSON-encoded 60 ~ 100 MB data write-back has to be completed ASAP, not ALAP, as one never knows, how soon another request from the same user will arrive and any next shot has to re-fetch an already updated JSON-data for any next request ( perhaps even some user-specific serialisation of sequence of requests will have to get implemented, so as to avoid loosing the mandatory order of self-evolution of this very same user-specific JSON-data sequential self-updates )

( subtotal for about 10k+ many-users many-transactions many-predictions
will IMHO hardly remain inside a few tens of [ms] )

Architecture?

Well, given the O/P sketched computation strategy, there seems to be no architecture to "save" all there requested principal inefficiencies.

For industry-segments where ultra-low latency designs are a must, the core design principle is to avoid any unnecessary sources of increasing the end-to-end latencies.

binary-compact BLOBs rule ( JSON-strings are hell expensive in all stages, from storage, for all network transports' flows, till the repetitive ser-/DES-erialisation re-processing )
poor in-RAM computing scaling makes big designs to move ML-models closer to the exosystem periphery, not the singular CPU/RAM-blocker/CACHE-depleter inside the core of the NUMA ecosystem

( Does it seem complex? Yeah, it is complex & heterogeneous, distributed computing for (ultra)low-latency is a technically hard domain, not a free choice of some "golden bullet" architecture )

How can I serve ML models quickly and with a low latency

1 Answers1

Architecture?