Automatically Normalizing a Postgres JSON Column into a New Table

Question

I have a very large Postgres table with millions of rows. One of the columns is called data and is of type JSONB with nested JSON (but thankfully no sub-arrays). The "schema" for the JSON is mostly consistent, but has evolved a bit over time, gaining and losing new keys and nested keys.

I'd like a process by which I can normalize the column into a new table, and which is as simple a process as possible.

For example, if the table looked like:

id | data
---+----------------------------------------------
  1| {"hi": "mom", "age": 43}
  2| {"bye": "dad", "age": 41}

it should create and populate a new table such as

id | data.hi | data.age | data.bye
---+----------------------------------------------
  1|     mom |       43 | NULL
  2|    NULL |       41 |  dad

(Note: that the column names aren't crucial)

In theory, I could do the following:

Select the column into a Pandas DataFrame and run a json_normalize on it
Infer the schema as the superset of the derived columns in step 1
Create a Postgres table with the schema of step 2 and insert (to_sql is an easy way to achieve this)

This doesn't seem too bad, but recall, the table is very large and we should assume that this cannot be done from a single DataFrame. If we try to do the next best thing -which is to batch the above steps- we'll run into the problem that the schema has changed slightly between batches.

Is there a better way to solve this problem then my approach? A "perfect" solution would be "pure SQL" and not involve any Python at all. But I'm not looking for perfection here. Just an automatic and robust process that doesn't require human intervention.

Read *Generalized solution* in [this post.](https://stackoverflow.com/a/35179515/1995738) — klin, Sep 11 '21 at 13:19

id'7238 · Answer 1 · 2021-09-11T14:30:55.970

You can try to create a new table via the CREATE TABLE AS statement.

CREATE TABLE newtable AS
SELECT 
  id, 
  (data->>'hi')::text AS data_hi,
  (data->>'bye')::text AS data_bye,
  (data->'age')::int AS data_age
FROM mytable

If the JSON structure is unknown, all keys and data types can be selected like this:

SELECT DISTINCT
  jsonb_object_keys(data) as col_name,
  jsonb_typeof(data->jsonb_object_keys(data)) as col_type
FROM mytable

Output:

col_name    col_type
--------------------
bye         string
hi          string
age         number

For a nested structure

id   data
---------
3    {"age": 33, "foo": {"bar": true}}

you can use a recursive query:

WITH RECURSIVE cte AS (
  select
    jsonb_object_keys(data) as col_name,
    jsonb_object_keys(data) as col_path,
    jsonb_typeof(data->jsonb_object_keys(data)) as col_type,
    data
  from mytable
  union all
  select
    jsonb_object_keys(data->col_name) as col_name,
    col_path || '_' || jsonb_object_keys(data->col_name) as col_path, 
    jsonb_typeof(data->col_name->jsonb_object_keys(data->col_name)) as col_type,
    data->cte.col_name AS data
  from cte
  where col_type = 'object'
) 
SELECT distinct col_path AS col_name, col_type 
FROM cte
WHERE col_type <> 'object';

Output:

col_name    col_type
--------------------
age         number
foo_bar     boolean

Next, you need to build a list of columns for the SELECT clause based on this data for use in the CREATE TABLE AS statement, as shown above.

The following fiddle has a helper that generates the entire SQL:

db<>fiddle

Note that all numeric types, including fractional ones, will be designated as number type and require correction.

Thanks id'7238 for your answer. It shows how a `CREATE SELECT` can be used to populate the normalized table from the JSON column. However, in order to write the create statement, we needed to know the schema. I would like to have the SQL figure this out automatically. Perhaps [jsonb_populate_record](https://www.postgresql.org/docs/13/functions-json.html#id-1.5.8.22.5.11.2.2.9.1.2.1) might be useful in this regard. — Zephaniah Grunschlag, Sep 11 '21 at 04:10
Even if we are able to tackle the above problem, `jsonb_populate_record` only looks at the top level of the JSON. In order to unwind the sub-jsons we would somehow need a recursive version. — Zephaniah Grunschlag, Sep 11 '21 at 04:14
If the JSON data structure is unknown, then you can try to get all the names of the keys and their data type (see [db<>fiddle](https://dbfiddle.uk/?rdbms=postgres_13&fiddle=7c1cc3c70f7cfb4eb8960f00ca52a9e0)). But you will have to manually describe the list of columns in the SELECT clause based on this data and then execute the CREATE TABLE AS query, as described above. I don’t know how useful this will be in your case. — id'7238, Sep 11 '21 at 08:02
I’ll stew over that. Pretty spectacular SQL in that DB Fiddle! — Zephaniah Grunschlag, Sep 11 '21 at 11:07

Automatically Normalizing a Postgres JSON Column into a New Table

1 Answers1