Indexing a JSONB key in combination with other columns

Question

For searching on specific keys within a jsonb column, I wanted to create an index on the column.

Using: Postgres 10.2

Ignoring some unrelated columns, I have the table animals with these columns (omitting some unrelated columns):

animalid PK number
location (text)
type (text)
name (text)
data (jsonb) for eg: {"age": 2, "tagid": 11 }

I would need to search based on: location, type and tagId. Like:

where location = ? and type = 'cat' and (data ->> 'tagid') = ?

Other important points:

Only animals of type cats will be having a tag id, this is a new animal type being added now.
The number of "cats" in the whole table would be less compared to the other types of animals.
The table is huge, with millions of rows - and partitioned.

How do I make sure that the search is fast? Options I have considered:

Make a separate table cats to store: animal_id, location, tagId (although FK to the partitioned parent table is not possible)
Create an index on location, type and jsonb key.
Create a new (indexed) column tagId - which would be null for all animals but cat.

I do have an index on other columns on the table - but a bit confused on how to create an index to make the search for cats based on the tagid fast. Any suggestions?

UPDATE (Ignoring partitions):

(Testing on a partitioned table)

So I decided to go with the option as suggested by Erwin and tried creating an index

CREATE INDEX ON animals_211 (location, ((data->>'tagid')::uuid)) WHERE  type = 'cat';

And tried an EXPLAIN on the query (using the partitioned table to keep it simple):

explain select * from animals_211 a
where a.location  = 32341
and a.type  = 'cat'
and (data->>'tagid')::uuid = '5e54c1d9-3ea0-4bca-81d6-1000d90cc42c'

And from the results, it looks like it doesn't use the index created and does a sequential scan:

Seq Scan on animals_211  e  (cost=0.00..121.70 rows=1 width=327)                                                                                                        |
  Filter: ((location = 32341) AND ((type)::text = 'cat'::text) AND (((data ->> 'tagid'::text))::uuid = '5e54c1d9-3ea0-4bca-81d6-1000d90cc42c'::uuid

UPDATE 2 (not using partial index)

It somehow seems to be the partial index, as without it - it seems to work:

CREATE INDEX tag_id_index ON animals_211 (location, type, ((data->>'tagid')::uuid))

When I do an explain plan:

Index Scan using tag_id_index on animals_211 e  (cost=0.28..8.30 rows=1 width=327)                                                                                         
  Index Cond: ((location = 32341) AND ((type)::text = 'cat'::text) AND (((data ->> 'tagid'::text))::uuid = '5e54c1d9-3ea0-4bca-81d6-1000d90cc42c'::uuid))

Your update is confusing. The partial index is on the parent table, but Postgres 10 does not support indexes on the parent table. You should see an error msg for that. I added some more to my answer around the topic. I suggest to split off the question for the best index from the question for the best partitioning ... — Erwin Brandstetter, Feb 01 '21 at 18:15
@ErwinBrandstetter am able to create the index, no error messages. But it didn't show up on the table's index list. I agree, let's take partition out of the equation (I would need to do this manually for all partitions) - I tried to create the index specifically on one partition. Index got created, but the explain plan shows it still doesn't use the index. I tried only the location+type just to see and that works. So somehow it seems to be an issue with the jsonb index? — Chillax, Feb 02 '21 at 14:42
The partial index should definitely work. Maybe you need to run `ANALYZE`. Consider the notes & links I added to my answer. — Erwin Brandstetter, Feb 03 '21 at 03:30

Erwin Brandstetter · Accepted Answer · 2021-04-08T15:36:13.137

Basics (ignoring partitioning)

Based on your three "important points" I suggest a partial index on an expression:

CREATE INDEX ON animals ((data->>'tagid'))
WHERE  type = 'cat';

Use CREATE INDEX CONCURRENTLY ... to avoid locking issues with concurrent write access to the same table.

Postgres also gathers specific statistics for the partial index, which help the query planner to get appropriate estimates. Note that you need to run ANALYZE (or VACUUM ANALYZE) manually if you test the index right after creation before autovacuum can kick in. See:

If tagid is really some other data type than text, you can also cast the expression to optimize some more. See:

PostgreSQL Index on JSON

Your update suggests tagid stores UUID values. Read:

So consider this index instead:

CREATE INDEX ON animals (((data->>'tagid')::uuid))  -- !
WHERE  type = 'cat';

The extra set of parentheses around (data->>'tagid')::uuid is required to make the syntax unambiguous.
And a matching query:

SELECT *
FROM   animals
WHERE  location = 32341
AND    type = 'cats'
AND    (data->>'tagid')::uuid = '5e54c1d9-3ea0-4bca-81d6-1000d90cc42c';  -- !

Or - depending on the selectivity of each predicate and which variants of the query are possible - include location to make it a multicolumn index:

CREATE INDEX ON animals (location, ((data->>'tagid')::uuid))
WHERE  type = 'cat';

Or tagid first if you have queries without filtering on location. See:

Is a composite index also good for queries on the first field?

Since only relatively few rows are of type 'cat', the index will be relatively small, excluding the bulk of your "millions of rows". And we only ever need the index on tagid for cats to begin with. Win-win.

If possible break out the json key data->>'tagid' as dedicated column. (Like you considered as option 3.) Can be null where it's not applicable, null storage is very cheap. Makes storage and index cheaper, yet, and queries a bit simpler.

Does not using NULL in PostgreSQL still use a NULL bitmap in the header?

Partitioning

Postgres 10 does not support indexes on the parent table of a partitioned table. This was added in Postgres 11. Declarative partitioning has been improved a lot since. Consider upgrading to the current version 13 or later.

There is also the option of "old-style" partitioning with inheritance. Then you can have a separate partition for cats with an additional column tagid only there. The manual:

For declarative partitioning, partitions must have exactly the same set of columns as the partitioned table, whereas with table inheritance, child tables may have extra columns not present in the parent.

Sounds like a perfect fit. But inheritance has fallen out of favor with Postgres, so I would think twice before doing that.

Either way - be it declarative or inheritance - if you have all "cats" in a separate partition a non-partial index does the job, obviously:

CREATE INDEX ON cats (location, ((data->>'tagid')::uuid));

And the query can target the partition cats instead of the parent table:

SELECT *
FROM   cats
WHERE  location = 32341
AND    (data->>'tagid')::uuid = '5e54c1d9-3ea0-4bca-81d6-1000d90cc42c';

Targeting the parent table should work, too. (Not sure about Postgres 10.)

SELECT *
FROM   animals
WHERE  type = 'cat'
AND    location = 32341
AND    (data->>'tagid')::uuid = '5e54c1d9-3ea0-4bca-81d6-1000d90cc42c';

But activate partition pruning for that. The manual:

Note that partition pruning is driven only by the constraints defined implicitly by the partition keys, not by the presence of indexes. Therefore it isn't necessary to define indexes on the key columns.

All other partitions should be pruned, and then you should get an index scan on the cats partition only ...

Just updated my question, I did as you suggested. But doesn't seem to use the index. Anywhere I might be going wrong? — Chillax, Feb 01 '21 at 16:39
Upgrading Postgresql is not an option atm (although I understand the importance\relevance of it) and we do not want to create a separate column for this as we want the columns to be for common fields across animals and do not want a separate column for cat\new animals in the future — Chillax, Feb 02 '21 at 14:41
Just for the record: If more than a few percent are cats, it would typically still be efficient to have a dedicated column `tagid`, which is NULL for non-cats. Optionally enforced with a `CHECK` constraint. NULL storage is very cheap. Follow the link above. — Erwin Brandstetter, Feb 03 '21 at 03:18
I have managed to successfully create the indexers, but had to do them with lots of retries - as it was throwing deadlock errors (these are live tables with writes occurring at the same time). I am thinking of using CONCURRENTLY while creating index - could it help in preventing the deadlocks? (as in the documentation I could see that it would it wouldn't lock the table from writes, and will wait for the current transactions to complete - also allowing transactions during the index creation) — Chillax, Apr 08 '21 at 15:23
@Chillax: Yes, that's exactly what `CONCURRENTLY` is for. I added a link above, too. — Erwin Brandstetter, Apr 08 '21 at 15:37

Indexing a JSONB key in combination with other columns

1 Answers1

Basics (ignoring partitioning)

Partitioning