0

I have a TEXT keyvalues column in Postgres:

select * from test5 limit 5;

 id |                      keyvalues
----+------------------------------------------------------
  1 | ^ first 1 | second 3
  2 | ^ first 1 | second 2 ^ first 2 | second 3
  3 | ^ first 1 | second 2 | second 3
  4 | ^ first 2 | second 3 ^ first 1 | second 2 | second 2
  5 | ^ first 2 | second 3 ^ first 1 | second 3

My queries must exclude the ^ character from the middle of the match, so I'm using regular expressions:

explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';

                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=78383.31..78383.32 rows=1 width=8) (actual time=7332.030..7332.030 rows=1 loops=1)
   ->  Gather  (cost=78383.10..78383.30 rows=2 width=8) (actual time=7332.021..7337.138 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=77383.10..77383.10 rows=1 width=8) (actual time=7328.155..7328.156 rows=1 loops=3)
               ->  Parallel Seq Scan on test5  (cost=0.00..77382.50 rows=238 width=0) (actual time=7328.146..7328.146 rows=0 loops=3)
                     Filter: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
                     Rows Removed by Filter: 1666668
 Planning Time: 0.068 ms
 Execution Time: 7337.184 ms

The query works (zero rows match), but is way too slow at > 7 seconds.

I thought indexing with trigrams would help, but no luck:

create extension if not exists pg_trgm;
create index on test5 using gin (keyvalues gin_trgm_ops);

explain analyze select count(*) from test5 where keyvalues ~* '\^ first 1[^\^]+second 0';
                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1484.02..1484.03 rows=1 width=8) (actual time=23734.646..23734.646 rows=1 loops=1)
   ->  Bitmap Heap Scan on test5  (cost=1480.00..1484.01 rows=1 width=0) (actual time=23734.641..23734.641 rows=0 loops=1)
         Recheck Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
         Rows Removed by Index Recheck: 5000005
         Heap Blocks: exact=47620
         ->  Bitmap Index Scan on test5_keyvalues_idx  (cost=0.00..1480.00 rows=1 width=0) (actual time=1756.158..1756.158 rows=5000005 loops=1)
               Index Cond: (keyvalues ~* '\^ first 1[^\^]+second 0'::text)
 Planning Time: 0.412 ms
 Execution Time: 23734.722 ms

The query with the trigram index is 3x slower! It still returns the correct result (zero rows). I expected the trigram index to figure out immediately there's no second 0 string anywhere, and be super fast.

(Motivation: I want to avoid normalizing the keyvalues into another table, so I'm looking to encode the matching logic in a single TEXT field using text indexing and regexps instead. The logic works, but is too slow, as is JSONB.)

user124114
  • 8,372
  • 11
  • 41
  • 63
  • You really should be storing those as individual records in a table instead of trying to combine them like that. Or at the very least, doesn't postgres support array types? – Shawn Jun 08 '19 at 11:25
  • @Shawn No, Postgres doesn't support efficient `ILIKE` over TEXT arrays, only via the slow `unnest()` (sequential scan) AFAIK. But if you know a way, let me know! – user124114 Jun 08 '19 at 11:29
  • Then try (only for a test purpose) to normalize this table and check how it will perform. I will not be surprised if it will be faster than yoy solution with regexp. – krokodilko Jun 08 '19 at 17:16
  • @krokodilko But that would defeat the purpose of this question. Btw you can see such results in the hyperlinks, if you like. – user124114 Jun 08 '19 at 18:35

2 Answers2

2

As you saw, this will not work well with trigrams. Trigrams do not match over the space boundary, so if all of your data contain the same words, the index will match every row.

This may make things more clear:

with data as (select * from (values ('^ first 1 | second 3'), 
                                    ('^ first 1 | second 2 ^ first 2 | second 3'), 
                                    ('^ first 1 | second 2 | second 3'), 
                                    ('^ first 2 | second 3 ^ first 1 | second 2 | second 2'), 
                                    ('^ first 2 | second 3 ^ first 1 | second 3')
                             ) v(keyvalues)
)
select keyvalues, show_trgm(keyvalues) from data;
                      keyvalues                       |                                               show_trgm
------------------------------------------------------+-------------------------------------------------------------------------------------------------------
 ^ first 1 | second 3                                 | {"  1","  3","  f","  s"," 1 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
 ^ first 1 | second 2 ^ first 2 | second 3            | {"  1","  2","  3","  f","  s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
 ^ first 1 | second 2 | second 3                      | {"  1","  2","  3","  f","  s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
 ^ first 2 | second 3 ^ first 1 | second 2 | second 2 | {"  1","  2","  3","  f","  s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}
 ^ first 2 | second 3 ^ first 1 | second 3            | {"  1","  2","  3","  f","  s"," 1 "," 2 "," 3 "," fi"," se",con,eco,fir,irs,"nd ",ond,rst,sec,"st "}

Could you use a partial index to exclude the rows with ^ in the middle?

Jeremy
  • 6,313
  • 17
  • 20
2

According to the OP, the correct answer was given here on DBA.SE by user @jjanes:

I expected the trigram index to figure out immediately there's no second 0 string anywhere

'second' and '0' are separate words, so it cannot detect their joint absence as such. It seems like it could detect the absence of ' 0', but this comment from "contrib/pg_trgm/trgm_regexp.c" seems pertinent:

     * Note: Using again the example "foo bar", we will not consider the
     * trigram "  b", though this trigram would be found by the trigram
     * extraction code.  Since we will find " ba", it doesn't seem worth
     * trying to hack the algorithm to generate the additional trigram.

Since 0 is the last character in the pattern string, there will be no trigram of the form " 0a", either, so it just misses that opportunity.

Even if it were not for this limitation, your approach seems extremely fragile.

Glorfindel
  • 21,988
  • 13
  • 81
  • 109