Q: "Is that possible?"
Let's make a sketch of a single-user single-transaction end-2-end latency budget composition :
User may spend from about first 1 [ms]
if colocated, yet up to 150+ [ms]
for sending packet over the live, RTO connection ( Here we ignore all socket initiation & setup negotiations for simplicity )
Server may spend anything above 25+ [ms]
for "reading" an auth'd-user specific JSON-formatted string from RAM upon a 1st seeking/indexing of SER/DES-ed string of still string representation of the key:value
pairs ( Here we ignore all add-on costs of non-exclusive use of NUMA ecosystem, spent on actual finding, physical reading and cross-NUMA transport of those 60 ~ 100 MB
of auth'd-user specific data from a remote, about a TB-sized off-RAM storage into the final destination inside a local CPU-core RAM area for simplicity )
JSON-decoder may spend any amounts of additional time on repetitive key:value
tests over the 60 ~ 100 MB
data dictionary
ML-model may spend any amounts of additional time on .predict()
-method's internal evaluation
Server will spend some additional time for assembling a reply to the user
Network will again add transport latency, principally similar to the one experienced under item 1 above
Server will next spend some additional time for a per-user & per-incident specific modification of the in-RAM, per-user maintained, JSON-encoded 60 ~ 100 MB
data dictionary ( This part ought always happen after items above, if UX latency was a design priority )
Server will next spend some additional time on an opposite direction of cross-NUMA exosystem data transport & storage. While mirroring the item 2, this time the data-flow may enjoy non-critical/async/cached/latency masked deferred usage of physical resources' patterns, which was not the case under item 2, where no pre-caching will happen unless some TB-sized, exclusive-use, never-evicted cache footprints are present and reserved end-to-end, alongside the whole data transport trajectory from the local CPU-core in-RAM representation, re-SER-ialisation into string, over all the cross-NUMA exosystem interconnects, to the very last cold-storage physical storage device (which is almost sure will not happen here)
( subtotal ... [ms]
for a single-user single-transaction single-prediction )
Let's make a sketch of what else gets wrong once many-users many-transactions reality gets into the ZOO :
a.
All so far optimistic ( having been assumed as exclusive ) resources will start to degrade in processing performance / transport throughputs, which will add and/or increase actually achieved latencies, because concurrent requests will now result in entering blocking states ( both on micro-level like CPU-core LRU cache resupply delays, cross-QPI extended access times to CPU-core non-local RAM areas and macro-level like enqueuing all calls to wait before a local ML-model .predict()
-method is free to run, none of which were present in the non-shared single-user single-transaction with unlimited exclusive resources usage above, so never expect a fair split of resources )
b.
Everything what was "permissive" for a deferred ( ALAP ) write in the items 7 & 8 above, will now become a part of the end-to-end latency critical-path, as also the JSON-encoded 60 ~ 100 MB
data write-back has to be completed ASAP, not ALAP, as one never knows, how soon another request from the same user will arrive and any next shot has to re-fetch an already updated JSON-data for any next request ( perhaps even some user-specific serialisation of sequence of requests will have to get implemented, so as to avoid loosing the mandatory order of self-evolution of this very same user-specific JSON-data sequential self-updates )
( subtotal for about 10k+ many-users many-transactions many-predictions
will IMHO hardly remain inside a few tens of [ms]
)
Architecture?
Well, given the O/P sketched computation strategy, there seems to be no architecture to "save" all there requested principal inefficiencies.
For industry-segments where ultra-low latency designs are a must, the core design principle is to avoid any unnecessary sources of increasing the end-to-end latencies.
binary-compact BLOBs rule ( JSON-strings are hell expensive in all stages, from storage, for all network transports' flows, till the repetitive ser-/DES-erialisation re-processing )
poor in-RAM computing scaling makes big designs to move ML-models closer to the exosystem periphery, not the singular CPU/RAM-blocker/CACHE-depleter inside the core of the NUMA ecosystem
( Does it seem complex? Yeah, it is complex & heterogeneous, distributed computing for (ultra)low-latency is a technically hard domain, not a free choice of some "golden bullet" architecture )