1

Having an RDS Postgresql 12.5 and working on a table (app_events) with a JSONB column (metadata), the JSON data may vary based on the event name, see the structure below.

CREATE TABLE IF NOT EXISTS "public".app_events (
    id uuid DEFAULT uuid_generate_v4() NOT NULL,
    event_id text NOT NULL,
    name text NOT NULL,
    creation_time timestamp without time zone NOT NULL,
    creation_time_in_milliseconds bigint NOT NULL,
    metadata jsonb NOT NULL,
    PRIMARY KEY(id)
);

There are default indexes (btree) on event_id and name.

Based on the name column we create views to normalize the data from JSON into tabular format, an example is below.

CREATE OR REPLACE VIEW "public".charity_created AS
 SELECT app_events.id,
    app_events.event_id,
    app_events.name,
    app_events.creation_time,
    date(app_events.creation_time) AS created_date,
    (app_events.metadata ->> 'aggregateId'::text) AS user_id,
    (app_events.metadata ->> 'url'::text) AS url,
    (app_events.metadata ->> 'name'::text) AS charity_name,
    (app_events.metadata ->> 'about'::text) AS charity_about,
    (app_events.metadata ->> 'country'::text) AS country,
    (app_events.metadata ->> 'category'::text) AS category,
    (app_events.metadata ->> 'currencyCode'::text) AS currencycode,
    (app_events.metadata ->> 'isPayItForwardPartner'::text) AS is_pay_it_forward_partner,
    (app_events.metadata ->> 'isCampaignDonationPartner'::text) AS is_campaign_donation_partner
   FROM public.app_events
  WHERE (app_events.name = 'CharityCreated'::text)
  ORDER BY (date(app_events.creation_time)) DESC;

And now you can run queries like the below.

SELECT * 
FROM "public".charity_created
WHERE charity_name == 'some_charity_name'

In addition, we create joins or unions between views and started noticing latency on reads/queries up to an hour sometimes, no timeouts but returning data can be challenging, definitely a big hit in our reporting team.

The question and knowledge I am looking for are, where can I create (or should create) indexes to improve the read latency; Debating with two findings so far.

  1. On the actual JSONB column (metadata) doesn't make sense, so maybe a solution is to start creating indexes based on the specific schema per event name like this answer
  2. Create materialized views and pay the price to refresh the copy of data (every night) while creating indexes on the materialized views, similar to this answer
  3. Maybe keep the views and create indexes on these
George Taskos
  • 8,324
  • 18
  • 82
  • 147

1 Answers1

1

You can try to use indexes with expressions like:

CREATE INDEX app_events_charity_name_idx ON app_events USING(metadata ->> 'name'::text);

This index might be used when the expression is used (exactly) in the query, e.g.:

SELECT * FROM app_events WHERE metadata ->> 'name'::text = 'some charity';

This should work with your view as well. But you should check that with EXPLAIN.

In newer versions of Postgres there is also the possibility of generated columns. Compared to materialized views, these have the advantage that you do not have to worry about updating them.

ALTER TABLE app_events ADD charity_name TEXT GENERATED ALWAYS AS (metadata ->> 'name'::text);

You can create indexes for these columns as well.

clemens
  • 16,716
  • 11
  • 50
  • 65
  • The first option might be the way to go. We do query the view column though which pretty much is the expression indexed AS column_name, do you think this should work? I don't get the EXPLAIN part. – George Taskos Mar 28 '21 at 14:51
  • I guess I should add NOT NULL on create index; because the JSON schema is always different per event name. – George Taskos Mar 28 '21 at 15:33
  • You should definitely investigate it. Sometimes the optimiser can play unexpected surprises. How are you going to specify NOT NULL in an index definition? I also don't understand how this relates to the schema. – clemens Mar 28 '21 at 17:24
  • I saw that here https://stackoverflow.com/a/36076895/122769. CREATE INDEX animal_index ON farm (((animal ->> 'cow')::int)) WHERE (animal ->> 'cow') IS NOT NULL; – George Taskos Mar 28 '21 at 19:07
  • 1
    This is a partial index. It makes sense if the number of NULLs is large but in general you must include that condition into your queries so that the index is used. – clemens Mar 29 '21 at 07:22