Summary:
For a dropdown I need to compute the distinct values for a specific key stored inside a JSONField in a table in Postgres database. Worst case scenario: The table contains 1-10 million entries.
Background:
I'm developing a setup in which I have multiple deployments (one for each customer). Each deployment contains a backend and multiple clients. Logs are posted continuously from clients to the backend. These logs will contain a field log_meta
, which includes a key named origin
, that describes from where the log-entry came. From customer to customer the value of origin
may vary, and I don't want to enforce a restricted set of values for origin
, but in general they denote the environment in which the client is running; "DEV" and "PRODUCTION" are potential values for origin
. In practice, there might be only 1-2 distinct values for origin
in a single deployment. The number of logs can be expected to be in the range of 1-10 millions.
from jsonfield import JSONField
from django.db import models
class Log(models.Model)
# Other fields
log_json = JSONField(default=list)
log_meta: JSONField(default=dict) # Will contain a key named origin
In an "administrator frontend" I want to support that the administrator can filter (via a dropdown) to see only logs coming from a specific origin. To do so, I need to extract the distinct values for the origin
field.
How can I compute this set of distinct values in Django, taking into account that the number of logs may in some cases be in the 1-10 million range?
What I've tried already:
- Nothing, since I don't know how it can be done.
Additional info:
- The backend is written in Django, using Postgres as database.
- If it is not feasible to compute the values on the fly, my alternative is to build the set of distinct values continuously as the logs roll in. I consider this a second options, since it introduces additional state; if possible I prefer to just compute/derive the set instead.