1

I'm trying to index a large JSONB column based on a text field (with an ISO date string). This index works fine using = but is ignored if I use a > condition.

create table test_table (  
  id text NOT null primary key,
  data jsonb,
  text_test text
);

Then I add a bunch data to the jsonb column. And to ensure my JSON is valid, extract/copy the value I'm interested in from the JSONB column into another text column to test against too.

update test_table set text_test = (data->>'dueDate');

A quick sample shows it's good ISO formatted date strings:

select text_test, (data->>'dueDate') from test_table limit 1;
-- 2020-08-07T11:59:00  2020-08-07T11:59:00

I add btree indexes to both the JSONB and the text_test copy column. I tried adding one with explicit '::text' casting, as well as one with 'text_pattern_ops'.

create index test_table_duedate_iso on test_table using btree(text_test);
create index test_table_duedate_iso_jsonb on test_table using btree((data->>'dueDate'));
create index test_table_duedate_iso_jsonb_cast on test_table using btree(((data->>'dueDate')::text));
create index test_table_duedate_iso_jsonb_cast_pattern on test_table using btree(((data->>'dueDate')::text) text_pattern_ops);

Now if I query an exact value, explain shows it using the 'cast' version of the index. Good.

explain select * from test_table where (data->>'dueDate') = '2020-08-07T11:59:00';
"->  Bitmap Index Scan on test_table_duedate_iso_jsonb_cast  (cost=0.00..10.37 rows=261 width=0)"

But if I try it with a >, it does a very slow full scan.

explain analyze select count(*) from test_table where (data->>'dueDate') > '2020-04-14';
--Aggregate  (cost=10037.94..10037.95 rows=1 width=8) (actual time=1070.808..1070.813 rows=1 loops=1)
--  ->  Seq Scan on test_table  (cost=0.00..9994.42 rows=17409 width=0) (actual time=0.069..1057.258 rows=2930 loops=1)
--        Filter: ((data ->> 'dueDate'::text) > '2020-04-14'::text)
--        Rows Removed by Filter: 49298
--Planning Time: 0.252 ms
--Execution Time: 1070.874 ms

So just to check my sanity, I do the same query against the text_test column, it uses it's index as desired:

explain analyze select count(*) from test_table where text_test > '2020-04-14';
--Aggregate  (cost=6037.02..6037.03 rows=1 width=8) (actual time=19.979..19.984 rows=1 loops=1)
--  ->  Bitmap Heap Scan on test_table  (cost=77.76..6030.14 rows=2754 width=0) (actual time=1.354..11.007 rows=2930 loops=1)
--        Recheck Cond: (text_test > '2020-04-14'::text)
--        Heap Blocks: exact=455
--        ->  Bitmap Index Scan on test_table_duedate_iso  (cost=0.00..77.07 rows=2754 width=0) (actual time=1.215..1.217 rows=2930 loops=1)
--              Index Cond: (text_test > '2020-04-14'::text)
--Planning Time: 0.145 ms
--Execution Time: 20.041 ms

I have also tested indexing a numerical field within the JSON and it actually works properly, using it's index for ranged type queries. So it's something about the text field or something I'm doing wrong with it.

PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-14), 64-bit

user184190
  • 107
  • 1
  • 8
  • 1
    It may have something to do with the fact that Postgres does not maintain statistics on jsonb values. It just assumes a flat 1% selectivity. Try turning bitmap scan off just to test this theory. set enable_bitmapscan = off; – ccleve Jan 07 '21 at 00:32
  • Why are you using BTREE instead of GIN? BTREE is hardly useful for jsonb. Another issue is that you're now trying to compare a text field (data->>'dueDate') with a date. Less than and greater than don't make a lot of sense when using text fields. – Frank Heikens Jan 07 '21 at 08:59
  • @jjanes Sorry I mixed up the test cases I included when creating the question (< vs. >). I've updated it to be consistent and added further explain analyze details. – user184190 Jan 07 '21 at 14:12
  • @FrankHeikens As I understand it, I'm effectively creating a functional index on a single JSONB attribute. And since I want to do a range search on it, a btree is well suited. I thought gin was more suited to lookups and composite searches, which is not my need. And I'm not using dates anywhere, this is all text, due to JSON limitations and to_date index limitations based on being mutable. – user184190 Jan 07 '21 at 14:20
  • @ccleve Following up on your idea I turned 'enable_seqscan' off and now it's using my index and taking 20ms. Can I make that happen without that setting, which seems unwise for production? – user184190 Jan 07 '21 at 14:23
  • @user184190 Forcing the use of an index is difficult. See https://stackoverflow.com/questions/309786/how-do-i-force-postgres-to-use-a-particular-index If your app allows it, you might consider extracting dueDate out to a regular column and putting a btree on that. – ccleve Jan 07 '21 at 21:35
  • 1
    Thanks for correcting the direction of the comparison. The actual row counts are the same, but the expected ones are quite different. Did you ANALYZE the table after creating the functional index? Functional indexes have their own statistics, but they aren't collected until the table gets analyzed. And just creating a functional index will not trigger an auto-analyze to take place. – jjanes Jan 07 '21 at 22:56

0 Answers0