2

I have nearly 50 json files. I want to create Postgres database based on these files. Each file contains data of one table. The files are not very large (at maximum several thousand records). Example data from customers.json (in fact there are more fields, I have simplified it):

[ 
    {
    "Id": 55948,
    "FullName": "Full name #1",
    "Address": "Address #1",
    "Turnover": 120400.5,
    "DateOfRegistration": "2014-02-13",
    "LastModifiedAt": "2015-11-03 12:04:44" },
    {
    "Id": 55949,
    "FullName": "Full name %2",
    "Address": "Address #2",
    "Turnover": 120000.0,
    "DateOfRegistration": "2012-12-01",
    "LastModifiedAt": "2015-11-04 17:14:21" }
]

I try to write a function which creates a table and inserts all the data to it. My attempt is based on dynamic query using EXECUTE:

CREATE OR REPLACE FUNCTION import_json(table_name text, data json)
RETURNS VOID AS $$
DECLARE
    query text;
    colname text;
BEGIN
    query := 'CREATE TABLE ' || table_name || ' (';
    FOR colname IN SELECT json_object_keys(data->0)
    LOOP query := query || lower(colname) || ' text,';
    END LOOP;
    query := rtrim(query, ',') || ');';
    EXECUTE(query);
END $$ LANGUAGE plpgsql;

My function creates a table with expected column names, but all columns are of type text. The problem is that I do not know how to define proper types of columns.

The json files are formatted well and contain integer, numeric, date, timestamp and text values. I would like to get the table:

CREATE TABLE customers (
  id integer, 
  fullname text, 
  address text, 
  turnover numeric, 
  date_of_registration date, 
  last_modified_at timestamp);

The main question: how can I recognize types of columns in generated table?

Additionally, is there an easy way to transform Pascal to underscore notation ("DateOfRegistration" -> "date_of_registration")?

thor
  • 21,418
  • 31
  • 87
  • 173
markasz
  • 125
  • 1
  • 6
  • are you opposed to use just plpgsql or would external languages be ok. Also do you have the schema for the table where you define what column types should be, or are you trying to infer it from the field name, IE fieldname starts with "date" set to date, field name is id, then integer? – Doon Nov 16 '15 at 01:54
  • 1
    Semi-related tip: Use `format` with `%I` instead of using `||` to concatenate in dynamic SQL. Personally I'd do this job with Python or something, anyway. – Craig Ringer Nov 16 '15 at 01:55
  • if you're open to using python, then this is doable without too much trouble. Hint for your main question: [Use a json schema generator](https://github.com/perenecabuto/json_schema_generator) to figure out the types of the target table., [SO thread that answers the 2nd question](http://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-camel-case) – Haleemur Ali Nov 16 '15 at 02:13

1 Answers1

4

You can determine the type of a column by examining a value. The function below formats column's definition from a pair (key, value). It uses regex pattern matching. It also transforms column's name to the notation with underscores (using regexp_replace() function). Of course, the function wont work properly if the value represents NULL, so you have to check that the first json record has all not-null values.

create or replace function format_column(ckey text, cval text)
returns text language sql immutable as $$
    select format('%s %s',
        lower(regexp_replace(ckey, '(.)([A-Z])', '\1_\2', 'g')),
        case 
            when cval ~ '^[\+-]{0,1}\d+$' then 'integer'
            when cval ~ '^[\+-]{0,1}\d*\.\d+$' then 'numeric'
            when cval ~ '^"\d\d\d\d-\d\d-\d\d"$' then 'date'
            when cval ~ '^"\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d"$' then 'timestamp'
            else 'text' 
        end
    )
$$;

select format_column(key, value)
from (
    values 
        ('Id', '55948'),
        ('FullName', '"Full name #1"'),
        ('Turnover', '120400.5'),
        ('DateOfRegistration', '"2014-02-13"')
    ) val(key, value);

       format_column       
---------------------------
 id integer
 full_name text
 turnover numeric
 date_of_registration date
(4 rows)        

In the main function you do not need variables or loops. Use format() function to format strings with parameters and string_agg() to create textual lists. Since you need both keys and values, use json_each() instead of json_object_keys(). In the second query you can use row_number() to ensure that the aggregated list of values is divided for consecutive records.

create or replace function import_table(table_name text, jdata json)
returns void language plpgsql as $$
begin
    execute format('create table %s (%s)', table_name, string_agg(col, ', '))
    from (
        select format_column(key::text, value::text) col
        from json_each(jdata->0)
        ) sub;

    execute format('insert into %s values %s', table_name, string_agg(val, ','))
    from (
        with lines as (
            select row_number() over () rn, line
            from (
                select json_array_elements(jdata) line
                ) sub
            )
        select rn, format('(%s)', string_agg(value, ',')) val
        from (
            select rn, format('%L', trim(value::text, '"')) as value
            from lines, json_each(line)
            ) sub
        group by 1
        ) sub;
end $$; 

Test:

select import_table('customers', 
    '[{ "Id": 55948,
        "FullName": "Full name #1",
        "Address": "Address #1",
        "Turnover": 120400.5,
        "DateOfRegistration": "2014-02-13",
        "LastModifiedAt": "2015-11-03 12:04:44" },
    {   "Id": 55949,
        "FullName": "Full name %2",
        "Address": "Address #2",
        "Turnover": 120000.0,
        "DateOfRegistration": "2012-12-01",
        "LastModifiedAt": "2015-11-04 17:14:21" }]');

\d customers
                    Table "public.customers"
        Column        |            Type             | Modifiers 
----------------------+-----------------------------+-----------
 id                   | integer                     | 
 full_name            | text                        | 
 address              | text                        | 
 turnover             | numeric                     | 
 date_of_registration | date                        | 
 last_modified_at     | timestamp without time zone |

select * from customers;

  id   |  full_name   |  address   | turnover | date_of_registration |  last_modified_at   
-------+--------------+------------+----------+----------------------+---------------------
 55948 | Full name #1 | Address #1 | 120400.5 | 2014-02-13           | 2015-11-03 12:04:44
 55949 | Full name %2 | Address #2 | 120000.0 | 2012-12-01           | 2015-11-04 17:14:21
(2 rows)    
klin
  • 112,967
  • 15
  • 204
  • 232