12

I am trying to speed up the querying of some json data stored inside a PostgreSQL database. I inherited an application that queries a PostgreSQL table called data with a field called value where value is blob of json of type jsonb.

It is about 300 rows but takes 12 seconds to select this data from the 5 json elements. The json blobs are a bit large but the data I need is all in the top level of the json nesting if that helps.

I tried adding an index of CREATE INDEX idx_tbl_data ON data USING gin (value); but that didn't help. Is there a different index I should be using? The long term vision is to re-write the application to move that data out of the json but that is something is at least 30-40 man days of work due to complexity in other parts of the application so I am looking to see if I can make this faster in short term.

Not sure if it helps but the underlying data that makes up this result set doesn't change often. It's the data that is further down in the json blob that often changes.

SELECT
  value::json ->> 'name' AS name,
  value::json ->> 'mnemonic' AS mnemonic,
  value::json ->> 'urlName' AS "urlName",
  value::json ->> 'countryCode' AS "countryCode",
  value::json #>>'{team}' AS team
FROM
  data;
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
personalt
  • 810
  • 3
  • 13
  • 26
  • 1
    You are retrieving all rows of the table. No index will speed that up. A properly normalized data model would possibly be faster though –  Feb 11 '22 at 20:37
  • It seems to be extra slow because it is getting the data out of the json. The amount of data being returned is not likely the reason for it being slow. Like if I had those 5 elements in the table in their own columns it would likely take milliseconds to grab 300 records. The problem is those 5 'colulmns' are inside the json. I was hoping the index on the json would help it find the values within the blob faster – personalt Feb 11 '22 at 20:53
  • 1
    Normalize your data model and everything will be more efficient. To validate the theory about slow retrieval look at the plan generated using `explain (anaylze, buffers, timing)` ideally after setting `track_io_timing` to `on` –  Feb 11 '22 at 20:55
  • 1
    @a_horse_with_no_name - I didn't design this very poor data model. We do want to refactor this but the genus that came up with this model did this in first 1000 lines of application code and then built 300 man-days of application on top of it. So was just looking to see if anything can be done here to get it to access the blobs faster in short term – personalt Feb 11 '22 at 21:21

2 Answers2

11

Like a_horse already advised (and you mentioned yourself), the proper fix is to extract those attributes to separate columns, normalizing your design to some extent.

Can an index help?

Sadly, no (as of Postgres 14).

It could work in theory. Since your values are big, an expression index with just some small attributes can be picked up by Postgres in an index-only scan, even when retrieving all rows (where it otherwise would ignore indexes).

The manual:

However, PostgreSQL's planner is currently not very smart about such cases. It considers a query to be potentially executable by index-only scan only when all columns needed by the query are available from the index.

So you would have to include value itself in the index, even if just as INCLUDE column - totally spoiling the whole idea. No go.

You probably can still do something in the short term. Two crucial quotes:

I am looking to see if I can make this faster in short term

The json blobs are a bit large

Data type

Drop the cast to json from the query. Casting every time adds pointless cost.

Compression

One major cost factor will be compression. Postgres has to "de-toast" the whole large column, just to extract some small attributes. Since Postgres 14, you can switch the compression algorithm (if support is enabled in your version!). The default is dictated by the config setting default_toast_compression, which is set to pglz by default. Currently the only available alternative is lz4. You can set that per column. Any time.

LZ4 (lz4) is considerably faster, while compressing typically a bit less. About twice as fast, but around 10 % more storage (depends!). If performance is not an issue it's best to stick to the stronger compression of the default LZ algorithm (pglz). There may be more compression algorithms to pick from in the future.

To implement:

ALTER TABLE data
  ALTER COLUMN value SET COMPRESSION lz4;

Setting a new COMPRESSION for a column does not re-compress automatically. Postgres remembers the compression method and only re-compresses if it's forced to un-compress anyway. You may want to force re-compression of existing values. You can check with:

SELECT pg_column_compression(value) FROM data LIMIT 10;

Related blog post:

GENERATED columns

While stuck with the broken design you might just add some (small!) generated columns to cover your query:

ALTER TABLE data
  ADD COLUMN name text GENERATED ALWAYS AS (value::json ->> 'name') STORED
, ADD COLUMN mnemonic text GENERATED ALWAYS AS (value::json ->> 'mnemonic') STORED
...

And then target only those generated columns, not involving the big value at all.

SELECT name, mnemonic, ... FROM data;

That would bypass the main performance issue.

See:

However, you mentioned:

It's the data that is further down in the json blob that often changes.

Every change to value forces a re-check on the generated columns, so that adds write cost.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • The data is already in jsonb format. That is why I was surprised it was so slow. 13 seconds to pull 4 small text fields from 300 lines. But I can try the compression changes. Will the generated always refresh when the relevant parts of json change that could help me? – personalt Feb 12 '22 at 01:23
  • 1
    OK, you mentioned jsonb. But why cast to `json` in the query? Yes, generated columns are kept up to date automatically (at some cost). – Erwin Brandstetter Feb 12 '22 at 01:34
  • 2
    @personalt Yes, the fact that is already JSONB is why it is so pointless and harmful to cast it to the much slower JSON. – jjanes Feb 12 '22 at 03:35
  • 3
    @ErwinBrandstetter - I overlooked that query was casting to json. Even though I had written similar queries someone else wrote this. I kept looking as 'pull the data from the json'. I changed it to the query below and it speed up to less then a second. I will have to see how the rest of join works but this was helpful. SELECT (data.value ->> 'name'::text) AS name, (data.value ->> 'mnemonic'::text) AS mnemonic, (data.value ->> 'urlName'::text) AS "urlName", (data.value ->> 'countryCode'::text) AS "countryCode" FROM data – personalt Feb 13 '22 at 01:07
0

I am also having a similar performance issue, unfortunately I am not running PostgreSQL 12+ so cannot use the generated columns.

You have two solutions:

1.) Your table should have each column hardcoded and the parser should insert into those columns. (Ideal solution)

2.) Make a materialized view of the query and add an index to the primary key column of this view. You can then use refresh materialized view concurrently to refresh this view without having any performance impact of when another user/application is querying the view.

rup
  • 483
  • 5
  • 15