2

Problem

I am looking at a some 2-5 million events of a time series, batch-queried daily from a REST API to feed into rows of a PostgreSQL database. I use a RDMS because I want to have the time series pre-structured for faster analysis and exploration later on and he input schema does not change, really. So Cassandra or MongoDB are not an option although they would happily accept JSON as is.

The database is running on a cloud infrastructure. PostgreSQL is accessible through the network. Maybe the database itself residing on a network-attached file system, so I/O is suffering from the respective latency and limited bandwidth.

Also, if running the crawler/importer as a serverless function, any solution on top of the JVM or python has a significant overhead.

Objective

As the task itself is rather simple and linear, I was looking for a very small footprint solution - at best without the dependency of any runtime or interpreter in the first place.

So building on a small bash script cascading httpie / curl, jq and the native psql client like so sound promising:

#!/bin/bash

DATE=`date +%Y-%m-%d`

http POST https://example.net/my/api/v2 date=$DATE | \
jq -r '<jq json filter>' | \
psql -U myuser -d mydb`

Question

To do just that, how to convert a multi-level JSON including arrays to a prepared SQL STATEMENT using just jq?

Expected SQL statement:

INSERT INTO events (date,node,sensor,prop1,prop2,prop3,prop4) VALUES 
(2020-09-10,4170,1,0,-1,0,0),
(2020-09-10,4170,1,0,-1,300,0),
....
(2020-09-10,8888,2,0,-1,0,0)

JSON input:

{
  "date": "2020-09-10",
  "events": [
    {
      "sensor": 1,
      "intervals": [
        {
          "prop1": 0,
          "prop2": -1,
          "prop3": 0,
          "prop4": 0
        },
        {
          "prop1": 0,
          "prop2": -1,
          "prop3": 300,
          "prop4": 0
        }],
      "node": 4170
    },
    {
      "sensor": 2,
      "intervals": [
        {
          "prop1": 0,
          "prop2": -1,
          "prop3": 0,
          "prop4": 0
        }],
      "node": 8888
    }]
}
jenszo
  • 98
  • 5
  • How many .events per http response object? – peak Jun 27 '21 at 08:15
  • Is there any reason you can't do this in SQL? You just pass the whole JSON string to a SQL statement. –  Jun 27 '21 at 09:23
  • It's unclear to me if you want to insert into three different tables, or just one. Could you please [edit] your question and add the `create table` statements for the tables in question? –  Jun 27 '21 at 09:27
  • Thanks for the feedback. I just reworked both question and answer. Maybe that already reflects your questions. – jenszo Jun 27 '21 at 09:58

1 Answers1

2

I'm going to answer my own question here with what I've learned from people on #jq on LiberaChat since I think it might be well worth documenting for a broader audience.

Solution

Essentially, I've found two solutions.

The first "elegant" one incorporates a two-fold join. This will work with a smaller amount of data but performance will take a hit due to it's non-linear, two-pass join. 500.000 events take several minutes.

The second one results in linear, almost sequential processing but is a lot less jq-like. 500.000 events take a couple of seconds on my machine.

a) The elegant way

jq -r '
("INSERT INTO events (date,node,sensor,prop1,prop2,prop3,prop4) VALUES "),
([ "(\(.date as $date | .events[] | [.node, .sensor] as $d | .intervals[] | [$date, $d[], values[]] | join(",")))"] | join(",\n")),
("ON CONFLICT DO NOTHING;")
'

with the jq result as:

INSERT INTO events (date,node,sensor,prop1,prop2,prop3,prop4) VALUES 
(2020-09-10,4170,1,0,-1,0,0),
(2020-09-10,4170,1,0,-1,300,0),
(2020-09-10,8888,2,0,-1,0,0)
ON CONFLICT DO NOTHING;

JQ Play: https://jqplay.org/s/oU8mZUHHTm

b) The sequential but less elegant way

jq -r '
("INSERT INTO events (date,node,sensor,prop1,prop2,prop3,prop4) VALUES "),
([.date as $date | .events[] | [.node, .sensor] as $d | .intervals[] | [ $date, $d[], values[] ]] | ( .[:-1][] | "(" + join(",") + "),"), (last | "(" + join(",") + ")")),
("ON CONFLICT DO NOTHING;")
'

JQ Play: https://jqplay.org/s/pkVxWbBwC3

Both suggested solutions go the long way to avoid the excessive , at the end of the values list for SQL.

Why jq?

  • It is written in C yielding a very small footprint while offering quite advanced and well proven capabilities for filtering and restructuring JSON
  • It works stream based, so I'm expecting mostly sequential reads and writes (?)
  • The output can be adapted to the RDMS used and the application itself which makes this fairly portable

JQ Github Page

Alternative Solutions (Discussion)

  1. If the RDMS supports JSON natively (e.g. PostgreSQL) import directly.

Variations of this approach are described here on SO: (How can I import a JSON file into PostgreSQL?)

While Postgres' JSON parser is arguably well proven and quite efficient, this approach is not suitable for millions of rows at a time imho. Without having conducted performance experiments, I'd argue that this might perform a little less efficient as the JSON BLOB is not indexed and the postgres query planner will end up performing random seeks thereon (I may be wrong) and there is an additional copy involved.

  1. Create CSV from JSON and COPY into RDBMS

Most likely this already involves jq to create the CSV file. Maybe Python's pandas module is of any help here.

While this adds another layer of conversion to the matter, importing the CSV into the RDBMs is blazingly fast as it is merely a raw COPY of the rows with the index built on the fly. This may be good choice, yet, as opposed to the previous approaches, a COPY does not honour any TRIGGER that would act only on INSERTS/UPDATE if configured for an existing table.

If you were to filter out duplicates in a time series during import using ... ON CONFLICT DO NOTHING, this is not supported by COPY without an intermediate copy of the dataset: How Postgresql COPY TO STDIN With CSV do on conflic do update?

jenszo
  • 98
  • 5