1

I have a very large Postgres table with millions of rows. One of the columns is called data and is of type JSONB with nested JSON (but thankfully no sub-arrays). The "schema" for the JSON is mostly consistent, but has evolved a bit over time, gaining and losing new keys and nested keys.

I'd like a process by which I can normalize the column into a new table, and which is as simple a process as possible.

For example, if the table looked like:

id | data
---+----------------------------------------------
  1| {"hi": "mom", "age": 43}
  2| {"bye": "dad", "age": 41}

it should create and populate a new table such as

id | data.hi | data.age | data.bye
---+----------------------------------------------
  1|     mom |       43 | NULL
  2|    NULL |       41 |  dad  

(Note: that the column names aren't crucial)

In theory, I could do the following:

  1. Select the column into a Pandas DataFrame and run a json_normalize on it
  2. Infer the schema as the superset of the derived columns in step 1
  3. Create a Postgres table with the schema of step 2 and insert (to_sql is an easy way to achieve this)

This doesn't seem too bad, but recall, the table is very large and we should assume that this cannot be done from a single DataFrame. If we try to do the next best thing -which is to batch the above steps- we'll run into the problem that the schema has changed slightly between batches.

Is there a better way to solve this problem then my approach? A "perfect" solution would be "pure SQL" and not involve any Python at all. But I'm not looking for perfection here. Just an automatic and robust process that doesn't require human intervention.

1 Answers1

1

You can try to create a new table via the CREATE TABLE AS statement.

CREATE TABLE newtable AS
SELECT 
  id, 
  (data->>'hi')::text AS data_hi,
  (data->>'bye')::text AS data_bye,
  (data->'age')::int AS data_age
FROM mytable

If the JSON structure is unknown, all keys and data types can be selected like this:

SELECT DISTINCT
  jsonb_object_keys(data) as col_name,
  jsonb_typeof(data->jsonb_object_keys(data)) as col_type
FROM mytable

Output:

col_name    col_type
--------------------
bye         string
hi          string
age         number

For a nested structure

id   data
---------
3    {"age": 33, "foo": {"bar": true}}

you can use a recursive query:

WITH RECURSIVE cte AS (
  select
    jsonb_object_keys(data) as col_name,
    jsonb_object_keys(data) as col_path,
    jsonb_typeof(data->jsonb_object_keys(data)) as col_type,
    data
  from mytable
  union all
  select
    jsonb_object_keys(data->col_name) as col_name,
    col_path || '_' || jsonb_object_keys(data->col_name) as col_path, 
    jsonb_typeof(data->col_name->jsonb_object_keys(data->col_name)) as col_type,
    data->cte.col_name AS data
  from cte
  where col_type = 'object'
) 
SELECT distinct col_path AS col_name, col_type 
FROM cte
WHERE col_type <> 'object';

Output:

col_name    col_type
--------------------
age         number
foo_bar     boolean

Next, you need to build a list of columns for the SELECT clause based on this data for use in the CREATE TABLE AS statement, as shown above.

The following fiddle has a helper that generates the entire SQL:

db<>fiddle

Note that all numeric types, including fractional ones, will be designated as number type and require correction.

id'7238
  • 2,428
  • 1
  • 3
  • 11
  • Thanks id'7238 for your answer. It shows how a `CREATE SELECT` can be used to populate the normalized table from the JSON column. However, in order to write the create statement, we needed to know the schema. I would like to have the SQL figure this out automatically. Perhaps [jsonb_populate_record](https://www.postgresql.org/docs/13/functions-json.html#id-1.5.8.22.5.11.2.2.9.1.2.1) might be useful in this regard. – Zephaniah Grunschlag Sep 11 '21 at 04:10
  • Even if we are able to tackle the above problem, `jsonb_populate_record` only looks at the top level of the JSON. In order to unwind the sub-jsons we would somehow need a recursive version. – Zephaniah Grunschlag Sep 11 '21 at 04:14
  • 1
    If the JSON data structure is unknown, then you can try to get all the names of the keys and their data type (see [db<>fiddle](https://dbfiddle.uk/?rdbms=postgres_13&fiddle=7c1cc3c70f7cfb4eb8960f00ca52a9e0)). But you will have to manually describe the list of columns in the SELECT clause based on this data and then execute the CREATE TABLE AS query, as described above. I don’t know how useful this will be in your case. – id'7238 Sep 11 '21 at 08:02
  • I’ll stew over that. Pretty spectacular SQL in that DB Fiddle! – Zephaniah Grunschlag Sep 11 '21 at 11:07