26

I'm experimenting with keeping values like the following in a Postgres jsonb field in Postgres 9.4:

[{"event_slug":"test_1","start_time":"2014-10-08","end_time":"2014-10-12"},
 {"event_slug":"test_2","start_time":"2013-06-24","end_time":"2013-07-02"},
 {"event_slug":"test_3","start_time":"2014-03-26","end_time":"2014-03-30"}]

I'm executing queries like:

SELECT * FROM locations
WHERE EXISTS (
  SELECT 1 FROM jsonb_array_elements(events) AS e
  WHERE (
    e->>'event_slug' = 'test_1' AND
    (
      e->>'start_time' >= '2014-10-30 14:04:06 -0400' OR
      e->>'end_time' >= '2014-10-30 14:04:06 -0400'
    )
  )
)

How would I create an index on that data for queries like the above to utilize? Does this sound reasonable design for a few million rows that each contain ~10 events in that column?

Worth noting that it seems I'm still getting sequential scans with:

CREATE INDEX events_gin_idx ON some_table USING GIN (events);

which I'm guessing is because the first thing I'm doing in the query is converting data to json array elements.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Tony
  • 18,776
  • 31
  • 129
  • 193
  • In the query you have `e` as column name, in the index we see `events`. Please add a table definition (`CREATE TABLE` script) to avoid confusion. And your version of Postgres. You tagged `jsonb`, but speak of "Postgres json". Again, the table definition would clarify. – Erwin Brandstetter Dec 30 '14 at 11:53
  • @ErwinBrandstetter Sorry for the confusion, I updated the question with a query that makes a bit more sense. "events" is a jsonb column in the location table. All clear now? – Tony Dec 30 '14 at 16:56

2 Answers2

52

First of all, you cannot access JSON array values like that. For a given json value:

[{"event_slug":"test_1","start_time":"2014-10-08","end_time":"2014-10-12"},
 {"event_slug":"test_2","start_time":"2013-06-24","end_time":"2013-07-02"},
 {"event_slug":"test_3","start_time":"2014-03-26","end_time":"2014-03-30"}]

A valid test against the first array element would be:

WHERE e->0->>'event_slug' = 'test_1'

But you probably don't want to limit your search to the first element of the array. With the jsonb data type you have additional operators and index support.

At the time of asking, there was no built-in "greater than" or "less than" operator for jsonb columns. This changed with added SQL/JSON path functionality in Postgres 12.

You can choose between two operator classes for your GIN index. The manual:

jsonb_ops
@> (jsonb,jsonb)
@? (jsonb,jsonpath)
@@ (jsonb,jsonpath)
? (jsonb,text)
?| (jsonb,text[])
?& (jsonb,text[])

jsonb_path_ops
@> (jsonb,jsonb)
@? (jsonb,jsonpath)
@@ (jsonb,jsonpath)

(jsonb_ops being the default.) You can cover the equality test, but your requirement for >= comparison is only met with a jsonpath operator. (You need a btree index in older versions.)

CREATE INDEX locations_events_gin_idx ON locations
USING gin (events jsonb_path_ops);

Basic solution

Postgres 12 or later

SELECT l.*
FROM   locations l
WHERE  l.events @? '$[*] ? (@.event_slug == "test_1")
                         ? (@.end_time.datetime() < "2014-10-13".datetime()' 

Or, if you really need to "OR" two filters (see below):

SELECT l.*
FROM   locations l
WHERE  l.events @? '$[*] ? (@.event_slug == "test_1")
                         ? (@.start_time.datetime() < "2014-10-13".datetime()  || @.end_time.datetime() < "2014-10-13".datetime())' 

This is much simpler now than my original answer for older versions.

Any Postgres version

SELECT * FROM locations WHERE events @> '[{"event_slug":"test_1"}]';

This might be good enough if the filter is selective enough.
Assuming end_time >= start_time, so we don't need two checks. Checking only end_time is cheaper and equivalent:

SELECT l.*
FROM   locations l
     , jsonb_array_elements(l.events) e
WHERE  l.events @> '[{"event_slug":"test_1"}]'
AND   (e->>'end_time')::timestamp >= '2014-10-30 14:04:06'::timestamptz;

Related:

Utilizing an implicit JOIN LATERAL. Details (last chapter):

Careful with the different data types! What you have in the JSON value looks like timestamp [without time zone], while your predicates use timestamp with time zone literals. The timestamp value is interpreted according to the current time zone setting, while the given timestamptz literals must be cast to timestamptz explicitly or the time zone would be ignored! Above query should work as desired. Detailed explanation:

More explanation for jsonb_array_elements():

Advanced solution

If the above is not good enough, I would consider a MATERIALIZED VIEW that stores relevant attributes in normalized form. This allows plain btree indexes.

The code assumes that your JSON values have a consistent format as displayed in the question.

Setup:

CREATE TYPE event_type AS (
 , event_slug  text
 , start_time  timestamp
 , end_time    timestamp
);

CREATE MATERIALIZED VIEW loc_event AS
SELECT l.location_id, e.event_slug, e.end_time  -- start_time not needed
FROM   locations l, jsonb_populate_recordset(null::event_type, l.events) e;

Related answer for jsonb_populate_recordset():

CREATE INDEX loc_event_idx ON loc_event (event_slug, end_time, location_id);

Also including location_id to allow index-only scans. (See manual page and Postgres Wiki.)

Query:

SELECT *
FROM   loc_event
WHERE  event_slug = 'test_1'
AND    end_time  >= '2014-10-30 14:04:06 -0400'::timestamptz;

Or, if you need full rows from the underlying locations table:

SELECT l.*
FROM  (
   SELECT DISTINCT location_id
   FROM   loc_event
   WHERE  event_slug = 'test_1'
   AND    end_time  >= '2014-10-30 14:04:06 -0400'::timestamptz
   ) le
JOIN locations l USING (location_id);
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • This makes sense so yea ....the tricky part is the comparison. I'm fine with a larger index but it's also possible I'll get enough performance by indexing just the event_slug. However, I need to match those embedded hashes on multiple conditions so I only want to return items where the event slug is the one being queried for AND that particular event has the start time and end time requirements. Hopefully my new query example clarifies this. – Tony Dec 30 '14 at 17:00
  • @Tony: Added more, simplified and fixed a bug. – Erwin Brandstetter Jan 03 '15 at 04:16
  • Thanks for the detailed explanation. My hunch is it will be best to just normalize and not store this as jsonb (sorta like your materialized view recommendation although maybe just forget the view and move the data to an actual table) ...but I'll see how the performance pans out. The only reason I have to check for both start time and end time is either could be null. Also thanks for the extra effort regarding timestamps. I actually only need ~1 month resolution on this data so wasn't worried about it but very good to know. – Tony Jan 03 '15 at 18:39
  • @Tony: If `start_time` and `end_time` can be NULL I suggest to use `COALESCE(e.end_time, e.start_time) AS last_time` in the MV. Storing the data in normalized form is certainly a very good option. My answer is based on the assumption you would *need* JSON for some reason. – Erwin Brandstetter Jan 03 '15 at 20:01
  • This answer provides me a good starting point for my data schema evolution: I want to store json, and query it too. I can start with postgresql jsonb, and transition to materialized views of that json data, and finally a normalized table of that json data. Thanks @ErwinBrandstetter – Kiran Subbaraman Feb 04 '15 at 06:47
  • @ErwinBrandstetter: I am getting an erorr with the Basic solution: ERROR: invalid input syntax for type timestamp: "end_time" LINE 5: AND e->>'end_time'::timestamp >= '2014-10-30 14:04:06 -04... –  May 15 '15 at 03:47
  • @kristen: Sorry, parentheses were missing. The type cast operator `::` precedes the json operator `->>`. Also: what you have there looks like a `timestamp with time zone` (you cut off the rest, so I can't be sure). You probably want to cast that to `timestamptz` or you lose the time zone offset. Read details here: http://stackoverflow.com/questions/9571392/ignoring-timezones-altogether-in-rails-and-postgresql/9576170#9576170 – Erwin Brandstetter May 15 '15 at 11:17
-1
CREATE INDEX json_array_elements_index ON
    json_array_elements ((events_arr->>'event_slug'));

Should get you started in the right direction.

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
Momer
  • 3,158
  • 22
  • 23