I have a very large Postgres table with millions of rows. One of the columns is called data
and is of type JSONB
with nested JSON (but thankfully no sub-arrays). The "schema" for the JSON is mostly consistent, but has evolved a bit over time, gaining and losing new keys and nested keys.
I'd like a process by which I can normalize the column into a new table, and which is as simple a process as possible.
For example, if the table looked like:
id | data
---+----------------------------------------------
1| {"hi": "mom", "age": 43}
2| {"bye": "dad", "age": 41}
it should create and populate a new table such as
id | data.hi | data.age | data.bye
---+----------------------------------------------
1| mom | 43 | NULL
2| NULL | 41 | dad
(Note: that the column names aren't crucial)
In theory, I could do the following:
- Select the column into a Pandas DataFrame and run a json_normalize on it
- Infer the schema as the superset of the derived columns in step 1
- Create a Postgres table with the schema of step 2 and insert (to_sql is an easy way to achieve this)
This doesn't seem too bad, but recall, the table is very large and we should assume that this cannot be done from a single DataFrame. If we try to do the next best thing -which is to batch the above steps- we'll run into the problem that the schema has changed slightly between batches.
Is there a better way to solve this problem then my approach? A "perfect" solution would be "pure SQL" and not involve any Python at all. But I'm not looking for perfection here. Just an automatic and robust process that doesn't require human intervention.