0

I've got a Postgres 9.4.4 database with 1.7 million records with the following information stored in a JSONB column called data in a table called accounts:

data: {
  "lastUpdated": "2016-12-26T12:09:43.901Z",
  "lastUpdatedTimestamp": "1482754183"
  }
}

The actual JSONB column stores much more information, but I've omitted the irrelevant data. The data format cannot be changed since this is legacy information.

I'm trying to efficiently obtain a count of all records where the lastUpdated value is greater or equal to some reference time (I'll use 2015-12-01T10:10:10Z in the following examples):

explain analyze SELECT count(*) FROM "accounts" 
WHERE data->>'lastUpdated' >= '2015-12-01T10:10:10Z';

This takes over 22 seconds:

Aggregate  (cost=843795.05..843795.06 rows=1 width=0) (actual time=22292.584..22292.584 rows=1 loops=1)
   ->  Seq Scan on accounts  (cost=0.00..842317.05 rows=591201 width=0)
       (actual time=1.410..22142.046 rows=1773603 loops=1)
         Filter: ((data ->> 'lastUpdated'::text) >= '2015-12-01T10:10:10Z'::text)
 Planning time: 1.234 ms
 Execution time: 22292.671 ms

I've tried adding the following text index:

CREATE INDEX accounts_last_updated ON accounts ((data->>'lastUpdated'));

But the query is still rather slow, at over 17 seconds:

Aggregate  (cost=815548.64..815548.65 rows=1 width=0) (actual time=17172.844..17172.845 rows=1 loops=1)
  ->  Bitmap Heap Scan on accounts  (cost=18942.24..814070.64 rows=591201 width=0)
      (actual time=1605.454..17036.081 rows=1773603 loops=1)
        Recheck Cond: ((data ->> 'lastUpdated'::text) >= '2015-12-01T10:10:10Z'::text)
        Heap Blocks: exact=28955 lossy=397518
        ->  Bitmap Index Scan on accounts_last_updated  (cost=0.00..18794.44 rows=591201 width=0)
            (actual time=1596.645..1596.645 rows=1773603 loops=1)
              Index Cond: ((data ->> 'lastUpdated'::text) >= '2015-12-01T10:10:10Z'::text)
Planning time: 1.373 ms
Execution time: 17172.974 ms

I've also tried following the directions in Create timestamp index from JSON on PostgreSQL and have tried creating the following function and index:

CREATE OR REPLACE FUNCTION text_to_timestamp(text)
RETURNS timestamp AS
$$SELECT to_timestamp($1, 'YYYY-MM-DD HH24:MI:SS.MS')::timestamp; $$
LANGUAGE sql IMMUTABLE;

CREATE INDEX accounts_last_updated ON accounts     
(text_to_timestamp(data->>'lastUpdated'));

But this doesn't give me any improvement, in fact it was slower, taking over 24 seconds for the query, versus 22 seconds for the unindexed version:

explain analyze SELECT count(*) FROM "accounts" 
WHERE text_to_timestamp(data->>'lastUpdated') >= '2015-12-01T10:10:10Z';

Aggregate  (cost=1287195.80..1287195.81 rows=1 width=0) (actual time=24143.150..24143.150 rows=1 loops=1)
  ->  Seq Scan on accounts  (cost=0.00..1285717.79 rows=591201 width=0)
      (actual time=4.044..23971.723 rows=1773603 loops=1)
        Filter: (text_to_timestamp((data ->> 'lastUpdated'::text)) >= '2015-12-01 10:10:10'::timestamp without time zone)
Planning time: 1.107 ms
Execution time: 24143.183 ms

In one last act of desperation, I decided to add another timestamp column and update it to contain the same values as data->>'lastUpdated':

alter table accounts add column updated_at timestamp;
update accounts set updated_at = text_to_timestamp(data->>'lastUpdated');
create index accounts_updated_at on accounts(updated_at);

This has given me by far the best performance:

explain analyze SELECT count(*) FROM "accounts" where updated_at >= '2015-12-01T10:10:10Z';

Aggregate  (cost=54936.49..54936.50 rows=1 width=0) (actual time=676.955..676.955 rows=1 loops=1)
  ->  Index Only Scan using accounts_updated_at on accounts
      (cost=0.43..50502.48 rows=1773603 width=0) (actual time=0.026..552.442 rows=1773603 loops=1)
        Index Cond: (updated_at >= '2015-12-01 10:10:10'::timestamp without time zone)
        Heap Fetches: 0
Planning time: 4.643 ms
Execution time: 678.962 ms

However, I'd very much like to avoid adding another column just to improve the speed of ths one query.

This leaves me with the following question: is there any way to improve the performance of my JSONB query so it can be as efficient as the individual column query (the last query where I used updated_at instead of data->>'lastUpdated')? As it stands, it takes from 17 seconds to 24 seconds for me to query the JSONB data using data->>'lastUpdated', while it takes only 678 ms to query the updated_at column. It doesn't make sense that the JSONB query would be so much slower. I was hoping that by using the text_to_timestamp function that it would improve the performance, but it hasn't been the case (or I'm doing something wrong).

Community
  • 1
  • 1
adamc
  • 1,315
  • 19
  • 16
  • In order for the query to use the index on the JSON column, your query **has** to use exactly the same expression as the index, so you should use `where text_to_timestamp(data->>'lastUpdated') > ...`) –  Mar 23 '17 at 06:47
  • I did actually use the same query as you suggested, I just didn't include it in my original question by mistake. I've updated the question to include the exact query I used in the analyze command that yielded an Execution time of 24143.183 ms. Is there something else that I'm missing? – adamc Mar 23 '17 at 11:35

1 Answers1

3

In your first and second try most execution time is spent on index recheck or filtering, which must read every json field index hits, reading json is expensive. If index hits a couple hundred rows, query will be fast, but if index hits thousands or hundreds of thousand rows - filtering/rechecking json field will take some serious time. In second try, using additionally another function makes it even worse. JSON field is good for storing data, but are not intended to be used in analytic queries like summaries, statistics and its bad practice to use json objects to be used in where conditions, atleast as main filtering condition like in your case. That last act of depression of yours is the right way to go :)

To improve query performance, you must add one or some several columns with key vales which will be used most in where conditions.

LongBeard_Boldy
  • 802
  • 8
  • 20
  • Thanks for the response. Marking this as the correct answer because no other answers have been provided, and I haven't figured out a solution. In the end, I've changed our API to use cursored pagination and to not return the count from this endpoint and the client agreed to this, so I was able to avoid the performance penalty. Not ideal, but it was pragmatic – adamc Mar 31 '17 at 00:49