8

Let's say we need to check if a jsonb column contains a particular value matching by a substring in any of the value (non-nested, only first level).

How does one effectively optimize a query to search entire JSONB column (this means every key) for a value?

Is there some good alternative to doing ILIKE %val% on jsonb datatype casted to text?

jsonb_each_text(jsonb_column) ILIKE '%val%'

As an example consider this data:

SELECT 
  '{
   "col1": "somevalue", 
   "col2": 5.5, 
   "col3": 2016-01-01, 
   "col4": "othervalue", 
   "col5": "yet_another_value"
  }'::JSONB

How would you go about optimizing a query like that when in need to search for pattern %val% in records containing different keys configuration for different rows in a jsonb column?

I'm aware that searching with preceding and following % sign is inefficient, thus looking for a better way but having hard time finding one. Also, indexing all the fields within the json column explicitly is not an option since they vary for each type of record and would create a huge set of indexes (not every row has the same set of keys).

Question

Is there a better alternative to extracting each key-value pair to text and performing an ILIKE/POSIX search?

Kamil Gosciminski
  • 16,547
  • 8
  • 49
  • 72
  • This may be a better suit for dba.stackexchange.com, I just wanted to get vast audience for this matter. – Kamil Gosciminski Apr 27 '16 at 16:56
  • 2
    [`pg_trgm`](http://www.postgresql.org/docs/current/static/pgtrgm.html) may be the best option (ilike/posix type) for that as you are still be using pattern matching criteria type in jsonb column – Dmitry S Apr 27 '16 at 17:23
  • @DmitrySavinkov could you please elaborate? I believe that I would still need to unpack the json data into separate rows. – Kamil Gosciminski Apr 27 '16 at 18:03
  • yes, you need to unpack the value, so `gin_trgm_ops` operator class can be applied, you can also check check this [answer](http://stackoverflow.com/a/32959762/3961156) – Dmitry S Apr 27 '16 at 18:26
  • Filters like `somethink LIKE '%%'` is inefficient by default because it always causes full scan of the data. So the @DmitrySavinkov 's suggestion is the almost best solution. IMO it should be the answer, with brief explanation. – Abelisto Apr 27 '16 at 19:27
  • My question seems not to clarify what I'm looking for. I'm aware of pg_trgm, inefficient like with preceding and following `%` lookup etc. What I am mainly asking for is a way to search every jsonb key by its value effectively (without unpacking it explicitly into text) – Kamil Gosciminski Apr 27 '16 at 23:54

1 Answers1

1

If you know you will need to query only a few known keys, then you can simply index those expressions.

This is a too simple but self explaining example:

create table foo as SELECT '{"col1": "somevalue", "col2": 5.5, "col3": "2016-01-01", "col4": "othervalue", "col5": "yet_another_value"}'::JSONB as bar;

create index pickfoo1 on foo ((bar #>> '{col1}'));
create index pickfoo2 on foo ((bar #>> '{col2}'));

This is the basic idea, even it isn't useful for ilike querys, but you can do more things (depending on your needs).

For example: If you need only case insensitive matching, it would be sufficient to do:

-- Create index over lowered value:
create index pickfoo1 on foo (lower(bar #>> '{col1}'));
create index pickfoo2 on foo (lower(bar #>> '{col2}'));

-- Check that it matches:
select * from foo where lower(bar #>> '{col1}') = lower('soMEvaLUe');

NOTE: This is only an example: If you perform an explain over the previous select, you will see that postgres actually performs a sequential scan instead of using the index. But this is because we are testing over a table with a single row, which is not the usual. But I'm sure you could test it with a bigger table ;-)

With huge tables, even like queries should benefit of the index if the firt wilcard doesn't appear at the beginning of the string (but it isn't a matter of jsonb but a matter of btree indexes itself).

If you need to optimize queries like:

select * from foo where bar #>> '{col1}' ilike '%MEvaL%';

...then you should consider using GIN or GIST indexes instead.

bitifet
  • 3,514
  • 15
  • 37
  • Thanks for your answer. Unfortunately, I'm afraid you misunderstood the point of this question. I'm asking for more efficient way to search through ENTIRE jsonb at first level than unpacking it. By entire I mean every key. This is for a search engine. I can't say which keys even does particular column have. – Kamil Gosciminski Aug 26 '16 at 17:53
  • That's why I began saying "If you know that…". If you need to be able to search by any (not preestablished) key it would be much more harder to optimize. – bitifet Aug 26 '16 at 21:51
  • Also, I dare say that trying to parse JSON "in a more efficient way" will not solve your problem. First of all because actually jsonb is'nt internally stored as json PRECISELY in order to be more efficiently accessed (I don't think you could achieve that faster than native access functions). – bitifet Aug 26 '16 at 21:56
  • …but MOST IMPORTANTLY because while you are searching anything in a database, no matter wich datatype are you searching for, and specially when the amount of data becomes huge, the point is not to parse or compare it faster, but NOT DOING IT AT ALL. This is the goal of the indexes. Optimizing parsing and or comparsion will, at most, delay the actual problem a bit (suposing your database grows with the time as most do). – bitifet Aug 26 '16 at 22:04
  • …so, I insist. Even my simple example doesn't fit your needs, no doubt: you need to approach it using indexes. It won't be trivial, of course. And I dare say it would be impossible with (default) btree. But I suggest you to investigate (quite deeply in fact) of GIN and GIST indexes. I'm not too expert to write out a feasible example easily. But I'm sure you could implement one to, at least, discard (WITHOUT SCAN) rows not having requested keys (and sure filtrer by ones having some chunks (trigrams, for example) making final seq scans much more shorter. – bitifet Aug 26 '16 at 22:14
  • 3
    NOTE: Not as actual solution, but as a reasonable patch meanwhile: remember that postgres can index not only one (or more) column values but also complex expressions, INCLUDING FUNCTION CALLS (if the function is at least stable!), so you can implement, for example, a function retuining sorted list of json keys. Again, it won't be enough with a simple btree, but with tgrm (to not say implementing more specifig gin/gist one) it could be possible to filter first rows having the keys we want to look for. – bitifet Aug 26 '16 at 22:25