4

Suppose I have the following parents table:

create table parents (
  id       integer not null constraint parents_pkey primary key,
  name     text    not null,
  children jsonb   not null
);

Where children is a json array of the following structure:

[
    {
        "name": "child1",
        "age": 10
    }, 
    {
        "name": "child2",
        "age": 12
    } 
]

And I need, for example, to get all parents that have children with age between 10 and 12.

I create the following query:

select distinct
  p.*
from
  parents p, jsonb_array_elements(p.children) c
where
  (c->>'age')::int between 10 and 12;

It works well but very slowly when the table parents is big (for example 1M records). I tried to use 'gin' index on children field but this did not help.

So is there a way to speed up such queries? Or maybe there is another solution to make queries/indexes against fields in the nested json arrays?

Query plan:

Unique  (cost=1793091.18..1803091.18 rows=1000000 width=306) (actual time=4070.866..5106.998 rows=399947 loops=1)
  ->  Sort  (cost=1793091.18..1795591.18 rows=1000000 width=306) (actual time=4070.864..4836.241 rows=497313 loops=1)
        Sort Key: p.id, p.children, p.name
        Sort Method: external merge  Disk: 186040kB
        ->  Gather  (cost=1000.00..1406321.34 rows=1000000 width=306) (actual time=0.892..1354.147 rows=497313 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              ->  Nested Loop  (cost=0.00..1305321.34 rows=416667 width=306) (actual time=0.162..1794.134 rows=165771 loops=3)
                    ->  Parallel Seq Scan on parents p  (cost=0.00..51153.67 rows=416667 width=306) (actual time=0.075..239.786 rows=333333 loops=3)
                    ->  Function Scan on jsonb_array_elements c  (cost=0.00..3.00 rows=1 width=0) (actual time=0.004..0.004 rows=0 loops=1000000)
                          Filter: ((((value ->> 'age'::text))::integer >= 10) AND (((value ->> 'age'::text))::integer <= 12))
                          Rows Removed by Filter: 3
Planning time: 0.218 ms
Execution time: 5140.277 ms
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Cepr0
  • 28,144
  • 8
  • 75
  • 101
  • 2
    The only way to radically speed up querying these values is to normalize the model. `Children` are crying to be a table. – klin Mar 28 '18 at 12:52
  • @klin I know. It is a table now, but it want be a json array )) – Cepr0 Mar 28 '18 at 13:25
  • Oh no, don't do it to them! – klin Mar 28 '18 at 13:33
  • I, too, think the poor children deserve their own table. You can always add a `VIEW` returning the JSON representation. Should be fast and simple. But there *are* ways to help the poor children in the crowded `jsonb` value, too ... – Erwin Brandstetter Mar 28 '18 at 17:31
  • @ErwinBrandstetter You are giving them the hope! Children love their parents very much, let's help them stay with their parents! ) – Cepr0 Mar 28 '18 at 18:13

2 Answers2

5

A first immediate measure would be to make the query you have a bit faster:

SELECT *
FROM   parents p
WHERE  EXISTS (
   SELECT FROM jsonb_array_elements(p.children) c
   WHERE (c->>'age')::int BETWEEN 10 AND 12
   );

The EXISTS semi-join avoids duplication of rows in the intermediate table when multiple array objects match - and the need for DISTINCT ON in the outer query. But that's only mildly faster, yet.

The core problem is that you want to test for a range of integer values, while existing jsonb operators do not provide such functionality.

There are various ways around this. Not knowing any of this, here is a "smart" solution that solves the given example. The trick is to split the range to distinct values and use the jsonb containment operator @>:

SELECT *
FROM   parents p
WHERE (p.children @> '[{"age": 10}]'
OR     p.children @> '[{"age": 11}]'
OR     p.children @> '[{"age": 12}]');

Supported by a jsonb_path_ops GIN index:

CREATE INDEX parents_children_gin_idx ON parents USING gin (children jsonb_path_ops);

But if your ranges span more than a hand full of integer values, you'll need something more generic. As always, the best solution depends on the complete situation: Data distribution, value frequencies, typical ranges in queries, NULL values possible?, row size, read/write patterns, does every jsonb value have one or more matching age key? ...

Related answer with specialized, very fast index:

Related:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thank you very mach Erwin, it's brilliantly as always! Your solution with 'exists' gave more than 2 times acceleration according to the query plan, but the real query with 'limit 20' have got the data in only 45 ms - and this is on my slow home computer! – Cepr0 Mar 28 '18 at 19:31
  • 1
    @Cepr0: `LIMIT` is a game changer and can generally favor completely different query plans. The fact that Postgres has no real statistics for nested values *inside* a `jsonb` column can lead to suboptimal query plans for non-standard distributions. Depending on the complete situation there are still faster solutions for the query without `LIMIT`, too ... – Erwin Brandstetter Mar 28 '18 at 21:31
  • I do not have a concrete example now - we are considering the possibility to simplify the structure of the database. We are working with the Spring/Hibernate/PostgreSQL stack to build REST services and are thinking to move all dependent (embedded) entities to the table of the main/parent entity (aggregate root) in order to optimize/simplify the retrieving data with pagination. But without loss of functionality (data filtering on nested entities). If you can give any suggestion, I will be very grateful! – Cepr0 Mar 29 '18 at 11:10
  • @Cepr0: Not sure how to advise there. This related answer might be useful for pagination: https://stackoverflow.com/a/34291099/939860. Good luck with the migration! – Erwin Brandstetter Mar 29 '18 at 12:12
  • @Cepr0: you may be interested in the added link to a related answer. – Erwin Brandstetter Apr 01 '18 at 02:36
  • Thank you, Erwin, again! I seen almost all your significant answers on this topic.. Your help is very important for us! – Cepr0 Apr 03 '18 at 12:55
0

I suggest you try this way (this is from my experience).

WITH t AS (SELECT id, jsonb_array_elements(children) as child_data FROM parents)
SELECT *  
  FROM parents 
 WHERE id IN (
              SELECT id
                FROM t
               WHERE (child_data->>'age')::int between 10 and 12
           )

Hopefully it works.

Mabu Kloesen
  • 1,248
  • 7
  • 8