Pattern-matching for JSON values

Question

Running a local docker instance of Postgres 12.5 (with 4MB work_mem). I'm implementing this pattern to search arbitrary fields in json. Goal is to search and return the JSON column profile quickly.

I tried an EXISTS subquery on a materialized view:

CREATE TABLE end_user (
    id varchar NOT NULL,
    environment_id varchar NOT NULL,
    profile jsonb NOT NULL DEFAULT '{}'::jsonb,
    CONSTRAINT end_user_pkey PRIMARY KEY (environment_id, id)
);

CREATE INDEX end_user_environment_id_idx ON private.end_user USING btree (environment_id);
CREATE INDEX end_user_id_idx ON private.end_user USING btree (id);
CREATE INDEX end_user_profile_idx ON private.end_user USING gin (profile);

CREATE MATERIALIZED VIEW user_profiles AS
SELECT u.environment_id, u.id, j.key, j.value 
FROM  end_user u, jsonb_each_text(u.profile) j(key, value);

CREATE UNIQUE INDEX on user_profiles (environment_id, id, key); 
CREATE INDEX user_profile_trgm_idx ON user_profiles using gin (value gin_trgm_ops);

I have this query which is indexed correctly so that it executes in a few milliseconds over a million rows. ✅

select * from user_profiles 
where value ilike '%auckland%' and key = 'timezone' and environment_id = 'test';

Exec time 42ms

Bitmap Heap Scan on user_profiles  (cost=28935.65..62591.44 rows=9659 width=65)                  
  Recheck Cond: ((value ~~* '%auckland%'::text) AND (key = 'timezone'::text))                    
  Filter: ((environment_id)::text = 'test'::text)                                              
  ->  BitmapAnd  (cost=28935.65..28935.65 rows=9659 width=0)                                     
        ->  Bitmap Index Scan on user_profile_trgm_idx  (cost=0.00..2923.95 rows=320526 width=0) 
              Index Cond: (value ~~* '%auckland%'::text)                                         
        ->  Bitmap Index Scan on user_profiles_key_idx  (cost=0.00..26006.62 rows=994408 width=0)
              Index Cond: (key = 'timezone'::text)

However if I use it with an exists query in order to build up conditions like so:

select * users u
where 
   environment_id = 'test'
and exists (
    select 1 from user_profiles p
    where 
       value ilike '%auckland%' 
       and key = 'timezone'
       and p.id = u.id
       and environment_id = 'test'   
)

It performs very slowly.

Exec time 17.44 Seconds

Nested Loop  (cost=62616.01..124606.45 rows=9658 width=1459) (actual time=19206.818..28444.491 rows=332572 loops=1)                                            
  Buffers: shared hit=952734 read=624101                                                                                                                       
  ->  HashAggregate  (cost=62615.59..62707.52 rows=9193 width=15) (actual time=19205.238..19292.998 rows=332572 loops=1)                                       
        Group Key: (p.id)::text                                                                                                                                
        Buffers: shared hit=373 read=246174                                                                                                                    
        ->  Bitmap Heap Scan on user_profiles p  (cost=28935.65..62591.44 rows=9659 width=15) (actual time=278.211..18942.629 rows=332572 loops=1)             
              Recheck Cond: ((value ~~* '%auckland%'::text) AND (key = 'timezone'::text))                                                                      
              Rows Removed by Index Recheck: 17781109                                                                                                          
              Filter: ((environment_id)::text = 'test'::text)                                                                                                
              Heap Blocks: exact=43928 lossy=197955                                                                                                            
              Buffers: shared hit=373 read=246174                                                                                                              
              ->  BitmapAnd  (cost=28935.65..28935.65 rows=9659 width=0) (actual time=272.626..272.629 rows=0 loops=1)                                         
                    Buffers: shared hit=373 read=4291                                                                                                          
                    ->  Bitmap Index Scan on user_profile_trgm_idx  (cost=0.00..2923.95 rows=320526 width=0) (actual time=177.577..177.577 rows=332572 loops=1)
                          Index Cond: (value ~~* '%auckland%'::text)                                                                                           
                          Buffers: shared hit=373 read=455                                                                                                     
                    ->  Bitmap Index Scan on user_profiles_key_idx  (cost=0.00..26006.62 rows=994408 width=0) (actual time=92.586..92.589 rows=1000000 loops=1)
                          Index Cond: (key = 'timezone'::text)                                                                                                 
                          Buffers: shared read=3836                                                                                                            
  ->  Index Scan using end_user_id_idx on end_user u  (cost=0.42..6.79 rows=1 width=1459) (actual time=0.027..0.027 rows=1 loops=332572)                       
        Index Cond: ((id)::text = (p.id)::text)                                                                                                                
        Filter: ((environment_id)::text = 'test'::text)                                                                                                      
        Buffers: shared hit=952361 read=377927                                                                                                                 
Planning Time: 19.002 ms                                                                                                                                       
Execution Time: 28497.570 ms

This is a shame as the exists would be handy if it was fast as I would be able to add more conditions dynamically in my application code with extra conditions represented as additional exists clauses.

BTW a lateral join does speed this up, but I don't understand how I have such a big difference:

select * from users u,
lateral (
    select id from user_profiles p
    where 
        value ilike '%auckland%' 
        and key = 'timezone' 
        and environment_id = u.environment_id 
        and p.id = u.id
   ) ss
where u.environment_id = 'test';

Exec time 304ms

Gather  (cost=29936.07..91577.38 rows=9658 width=1474) (actual time=1100.824..15430.620 rows=332572 loops=1)                                                     
  Workers Planned: 2                                                                                                                                             
  Workers Launched: 2                                                                                                                                            
  Buffers: shared hit=1140551 read=436286                                                                                                                        
  ->  Nested Loop  (cost=28936.07..89611.58 rows=4024 width=1474) (actual time=602.490..14805.285 rows=110857 loops=3)                                           
        Buffers: shared hit=1140551 read=436286                                                                                                                  
        ->  Parallel Bitmap Heap Scan on user_profiles p  (cost=28935.65..62492.84 rows=4025 width=22) (actual time=602.078..12247.891 rows=110857 loops=3)      
              Recheck Cond: ((value ~~* '%auckland%'::text) AND (key = 'timezone'::text))                                                                        
              Rows Removed by Index Recheck: 5927036                                                                                                             
              Filter: ((environment_id)::text = 'test'::text)                                                                                                  
              Heap Blocks: exact=14659 lossy=65588                                                                                                               
              Buffers: shared hit=373 read=246174                                                                                                                
              ->  BitmapAnd  (cost=28935.65..28935.65 rows=9659 width=0) (actual time=1087.258..1087.259 rows=0 loops=1)                                         
                    Buffers: shared hit=373 read=4291                                                                                                            
                    ->  Bitmap Index Scan on user_profile_trgm_idx  (cost=0.00..2923.95 rows=320526 width=0) (actual time=853.075..853.076 rows=332572 loops=1)  
                          Index Cond: (value ~~* '%auckland%'::text)                                                                                             
                          Buffers: shared hit=373 read=455                                                                                                       
                    ->  Bitmap Index Scan on user_profiles_key_idx  (cost=0.00..26006.62 rows=994408 width=0) (actual time=231.295..231.295 rows=1000000 loops=1)
                          Index Cond: (key = 'timezone'::text)                                                                                                   
                          Buffers: shared read=3836                                                                                                              
        ->  Index Scan using end_user_id_idx on end_user u  (cost=0.42..6.74 rows=1 width=1459) (actual time=0.022..0.022 rows=1 loops=332572)                   
              Index Cond: ((id)::text = (p.id)::text)                                                                                                            
              Filter: ((environment_id)::text = 'test'::text)                                                                                                  
              Buffers: shared hit=1140178 read=190112                                                                                                            
Planning Time: 16.877 ms                                                                                                                                         
Execution Time: 15461.571 ms

Keen to hear any thoughts on why the exists subquery is so slow, and any other options I could look at here.

Distinct counts as requested by Erwin, note this is dummy data to test load but its reasonably close to production ratios

select count(distinct environment_id)  => 4 
     , count(distinct key)             => 33
     , count(distinct value)           => 15M  
from private.user_profiles;

Update after increasing working memory to 16MB as suggested by Erwin:

ALTER SYSTEM SET work_mem to '16MB'; SELECT pg_reload_conf();

Things are looking better with a 500ms execution time for the exists query. Which now explains like.

Gather  (cost=3926.79..400754.43 rows=9658 width=1459) (actual time=312.213..9396.610 rows=332572 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=1141083 read=431918
  ->  Nested Loop  (cost=2926.79..398788.63 rows=4024 width=1459) (actual time=155.271..8987.721 rows=110857 loops=3)
        Buffers: shared hit=1141083 read=431918
        ->  Parallel Bitmap Heap Scan on user_profiles p  (cost=2926.36..371669.88 rows=4025 width=15) (actual time=150.989..2962.870 rows=110857 loops=
              Recheck Cond: (value ~~* '%auckland%'::text)
              Filter: (((environment_id)::text = 'test'::text) AND (key = 'timezone'::text))
              Heap Blocks: exact=82556
              Buffers: shared hit=981 read=241730
              ->  Bitmap Index Scan on user_profile_trgm_idx  (cost=0.00..2923.95 rows=320526 width=0) (actual time=243.604..243.605 rows=332572 loops=1
                    Index Cond: (value ~~* '%auckland%'::text)
                    Buffers: shared hit=828
        ->  Index Scan using end_user_id_idx on end_user u  (cost=0.42..6.74 rows=1 width=1459) (actual time=0.054..0.054 rows=1 loops=332572)
              Index Cond: ((id)::text = (p.id)::text)
              Filter: ((environment_id)::text = 'test'::text)
              Buffers: shared hit=1140102 read=190188
Planning Time: 9.932 ms
Execution Time: 9427.067 ms

Erwin Brandstetter · Accepted Answer · 2023-08-01T13:54:08.623

SQL/JSON in Postgres 12 or later

My answer you have been working off is dated. The current Postgres version was 9.4 in July 2015.

In Postgres 12 the whole design can be radically simpler with regular expressions in an SQL/JSON path expression. The manual:

SQL/JSON path expressions allow matching text to a regular expression with the like_regex filter.

With index-support, too. Scrap the materialized view. All we need is your original table and an index like:

CREATE INDEX end_user_path_ops_idx ON end_user USING GIN (profile jsonb_path_ops);

This query is equivalent to your original and can use the index:

SELECT *
FROM   end_user u
WHERE  environment_id = 'test'
AND    profile @? '$.timezone ? (@ like_regex "auck" flag "i")';

db<>fiddle here

The one drawback is that the SQL/JSON path language needs getting used to.
Further reading:

Server configuration

A more basic problem becomes obvious in this line of the EXPLAIN output:

Heap Blocks: exact=14659 lossy=65588

lossy indicates that you don't have enough work_mem. Your setting is obviously very low. (The default setting of 4 MB is way to low for a database involving tables with millions of rows.) See:

Chances are, you need to do more in the department of server configuration. And you seem to be limited on RAM in general. I see high "read" counts, which indicates cold cache and/or a lack or mis-configuration of cache memory.

This Postgres Wiki page can help you get started.

@Tim: *No* index on `(profile)` is applicable for the expression `eu.profile->>'timezone' ilike '%auckland%'`. A trigram expression index would: `CREATE INDEX ON end_user USING GIN ((eu.profile->>'timezone') gin_trgm_ops)`. See: https://stackoverflow.com/a/36489851/939860, https://stackoverflow.com/a/13452528/939860. Faster for that particular expression. But the suggested approach with JSON path language and a `jsonb_path_ops` GIN index covers the complete JSON column, not just a single key. — Erwin Brandstetter, Jul 03 '21 at 02:52
I've seen two ways to do this: `jsonb_path_exists(profile, '$.** ? (@ like_regex "auck" flag "i")')` and `profile @? '$.** ? (@ like_regex "auck" flag "i")'`. Do you know if both will get the same results? — Marius Lian, Sep 20 '21 at 09:44

Pattern-matching for JSON values

1 Answers1

SQL/JSON in Postgres 12 or later

Server configuration

Linked