1

I have large table crumbs (about 100M+ rows, 100GB). It's just collection of json stored as text. It has index on column run_id that has about 10K unique values. So each run is small (1K - 1M rows).

For simple query:

explain analyze verbose select * from crumbs c 
where c.run_id='2016-04-26T19_02_01_015Z' limit 10

Plan is good:

Limit  (cost=0.56..36.89 rows=10 width=2262) (actual time=1.978..2.016 rows=10 loops=1)
  Output: id, robot_id, run_id, content, created_at, updated_at, table_id, fork_id, log, err
  ->  Index Scan using index_crumbs_on_run_id on public.crumbs c  (cost=0.56..5533685.73 rows=1523397 width=2262) (actual time=1.975..1.996 rows=10 loops=1)
        Output: id, robot_id, run_id, content, created_at, updated_at, table_id, fork_id, log, err
        Index Cond: ((c.run_id)::text = '2016-04-26T19_02_01_015Z'::text)
Planning time: 0.117 ms
Execution time: 2.048 ms

But if I try to look inside json stored in one of the columns it then wants to do full scan:

explain verbose select x from crumbs c, 
lateral json_array_elements(c.content::json) x
where c.run_id='2016-04-26T19_02_01_015Z' 
limit 10

Plan:

Limit  (cost=0.01..0.69 rows=10 width=32)
  Output: x.value
  ->  Nested Loop  (cost=0.01..10332878.67 rows=152343800 width=32)
        Output: x.value
        ->  Seq Scan on public.crumbs c  (cost=0.00..7286002.66 rows=1523438 width=895)
              Output: c.id, c.robot_id, c.run_id, c.content, c.created_at, c.updated_at, c.table_id, c.fork_id, c.log, c.err
              Filter: ((c.run_id)::text = '2016-04-26T19_02_01_015Z'::text)
        ->  Function Scan on pg_catalog.json_array_elements x  (cost=0.01..1.01 rows=100 width=32)
              Output: x.value
              Function Call: json_array_elements((c.content)::json)

Tried:

analyze crumbs

But made no difference.

Update 1 Disabling sequential scanning for whole database works, but this is not an option in our application. In many other places seq scan should stay:

set enable_seqscan=false;

Plan:

Limit  (cost=0.57..1.14 rows=10 width=32) (actual time=0.120..0.294 rows=10 loops=1)
  Output: x.value
  ->  Nested Loop  (cost=0.57..8580698.45 rows=152343400 width=32) (actual time=0.118..0.273 rows=10 loops=1)
        Output: x.value
        ->  Index Scan using index_crumbs_on_run_id on public.crumbs c  (cost=0.56..5533830.45 rows=1523434 width=895) (actual time=0.087..0.107 rows=10 loops=1)
              Output: c.id, c.robot_id, c.run_id, c.content, c.created_at, c.updated_at, c.table_id, c.fork_id, c.log, c.err
              Index Cond: ((c.run_id)::text = '2016-04-26T19_02_01_015Z'::text)
        ->  Function Scan on pg_catalog.json_array_elements x  (cost=0.01..1.01 rows=100 width=32) (actual time=0.011..0.011 rows=1 loops=10)
              Output: x.value
              Function Call: json_array_elements((c.content)::json)
Planning time: 0.124 ms
Execution time: 0.337 ms

Update 2:

Schema is:

CREATE TABLE crumbs
(
  id serial NOT NULL,
  run_id character varying(255),
  content text,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  CONSTRAINT crumbs_pkey PRIMARY KEY (id)
);

CREATE INDEX index_crumbs_on_run_id
  ON crumbs
  USING btree
  (run_id COLLATE pg_catalog."default");

Update 3

Rewriting query like so:

select json_array_elements(c.content::json) x
from crumbs c
where c.run_id='2016-04-26T19_02_01_015Z' 
limit 10

Gets correct plan. Still unclear why wrong plan is chosen for second query.

  • 1
    `((run_id)::text = '2016-04-26T19_02_01_015Z'::text)` run_id looks like a timestamp to me. Why do you store it as a text field? Also: please add the tabledefinition(s), including indexes. – joop May 10 '16 at 12:50
  • Yes, run_id is timestamp with text prefix. I omit prefix in question to avoid introducing unrelated complexity. Updating outputs with explain analyze verbose now. – Tomas Vitulskis May 10 '16 at 12:58
  • 1
    Sounds like a situation taylor made for jsonb – e4c5 May 10 '16 at 13:02
  • @e4c5 or may be MongoDB? ;-) – asgs May 10 '16 at 13:07
  • Please add the output of `explain (analyze, verbose)` that shows the seq scan, not the plan with seqscan turned off (by the way: `set enable_seqscan=false;` does **not** turn this off for the "whole database", it **only** changes this for the current session) –  May 10 '16 at 13:07
  • explain analyze verbose output added – Tomas Vitulskis May 10 '16 at 13:09
  • No, the second plan in your question is still the output from a plain `explain`, not from `explain (analyze, verbose) select x from crumbs c ...` –  May 10 '16 at 13:09
  • 1
    @asgs the benchmarks actually says postgresql 9.5 with JSON outperforms mongo :)) – e4c5 May 10 '16 at 13:11
  • @ a_horse_with_no_name: Cannot post "explain analyze" output for seq scan query - it runs forever. – Tomas Vitulskis May 10 '16 at 13:47

3 Answers3

0

You've got three different problems going on. First, the limit 10 in the first query is tipping the planner in favor of the index scan, which would otherwise be pretty expensive to get all rows matching that run_id. For the sake of comparison you might want to see what the first (un-joined) query plan looks like if you remove the limit. My guess is the planner switches to a table scan.

Second, that lateral join is unnecessary and throwing off the planner. You can expand the elements of the content array in your select clause like so:

select json_array_elements(content::json)
from crumbs
where run_id = '2016-04-26T19_02_01_015Z'
;

This is more likely to use the index scan to pick off rows for that run_id, then "unnest" the array elements for you.

But the third hidden problem is what you're actually trying to get. If you run this last query as is then you're in the same boat as the first (un-joined) query without a limit, which means you'll likely not get an index scan (not that that's inherently bad if you're reading such a large chunk of the table).

Do you want just the first few arbitrary array elements from all content arrays in that run? If so then tacking on a limit clause here should be the end of the story. If you want all array elements for this particular run then you may just have to accept a table scan, although without the lateral join you're potentially in a much better situation than the original query.

yieldsfalsehood
  • 3,005
  • 1
  • 19
  • 14
  • Removing limit gets correct plan in all cases (uses index, not full scan). Purpose of limit is as you guessed - to get arbitrary elements for preview of full job. Your rewrite solves the issue. But mystery remains why PostgreSQL chooses very expensive plan (scan 100GB) when cheap plan is available. – Tomas Vitulskis May 10 '16 at 14:03
0

Rewriting the query so that the limit is applied first and then the cross join against the function should make Postgres use the index:

Using a derived table:

select x 
from (
    select *
    from crumbs 
    where run_id='2016-04-26T19_02_01_015Z' 
    limit 10
) c 
  cross join lateral json_array_elements(c.content::json) x

Alternatively using a CTE:

with c as (
  select *
  from crumbs 
  where run_id='2016-04-26T19_02_01_015Z' 
  limit 10
)
select x
from c 
  cross join lateral json_array_elements(c.content::json) x

Or use json_array_elements() directly in the select list:

select json_array_elements(c.content::json) 
from crumbs c
where c.run_id='2016-04-26T19_02_01_015Z' 
limit 10

However this is something different then the other two queries because it applies the limit after "unnesting" the json array, not on the number of rows returned from the crumbs table (which is what your first query is doing).

  • Do you have any reference on generally discouraging putting SRFs in the select clause? – yieldsfalsehood May 10 '16 at 13:18
  • @yieldsfalsehood: this has been mentioned several times on the mailing lists. One reason for discouring it - as far as I recall - was that behaviour if you select other columns together with the function is not clearly defined. But I can't find that right now. –  May 10 '16 at 13:21
  • Thanks for the rewrites. They work, but I remain in magic land - seems like identical queries but plans are different and therefore we can never be certain if wrong plan will be selected at some point. – Tomas Vitulskis May 10 '16 at 13:45
  • @TomasVitulskis: you can **never** be sure if the "right" execution plan is selected. That is the downside of cost based optimizers. Adding or removing rows from a table can make the optimizer choose a different plan. Changing the values in a table can make the optimizer choose a different plan - in general it usually gets it right (and changing the plan _is_ a good thing). But no software is perfect. –  May 10 '16 at 13:51
  • @yieldsfalsehood: I found one mailing list thread: http://postgresql.nabble.com/Set-returning-functions-in-select-column-list-tp5491544p5492918.html "*This behavior is widely disliked ... The long-term plan is to implement LATERAL in FROM and then deprecate using SRFs in target lists altogether*" and [this](http://postgresql.nabble.com/Playing-with-set-returning-functions-in-SELECT-list-behaviour-intended-tp1920798p1920799.html) "*The lack of any obviously-sane way to handle multiple SRFs in a targetlist is exactly why the feature is looked on with disfavor*" –  May 10 '16 at 13:59
  • @a_horse_with_no_name thanks for the references. I had also found a patch for 9.4 for putting multiple tables in an unnest as well to deal with those funny situations. Interesting stuff! – yieldsfalsehood May 10 '16 at 14:06
0

Data modelling suggestions:

        -- Suggest replacing the column run_id (low cardinality, and rather fat)
        -- by a reference to a domain table, like:
        -- ------------------------------------------------------------------
CREATE TABLE runs
        ( run_seq serial NOT NULL PRIMARY KEY
        , run_id character varying UNIQUE
        );

        -- Grab all the distinct values occuring in crumbs.run_id
        -- -------------------------------------------------------
INSERT INTO runs (run_id)
SELECT DISTINCT run_id FROM crumbs;

        -- Add an FK column
        -- -----------------
ALTER TABLE crumbs
        ADD COLUMN run_seq integer REFERENCES runs(run_seq)
        ;

UPDATE crumbs c
SET run_seq = r.run_seq
FROM runs r
WHERE r.run_id = c.run_id
        ;
VACUUM ANALYZE runs;

        -- Drop old column and set new column to not nullable
        -- ---------------------------------------------------
ALTER TABLE crumbs
        DROP COLUMN run_id
        ;
ALTER TABLE crumbs
        ALTER COLUMN run_seq SET NOT NULL
        ;

        -- Recreate the supporting index for the FK
        -- adding id to support index-only lookups
        -- (and enforce uniqueness)
        -- -------------------------------------
CREATE UNIQUE INDEX index_crumbs_run_seq_id ON crumbs (run_seq,id)
        ;

        -- Refresh statistics
        -- ------------------
VACUUM ANALYZE crumbs; -- this may take some time ...

-- and then: join the runs table to your original crumbs table
-- -----------------------------------------------------------
-- explain analyze 
SELECT x FROM crumbs c
JOIN runs r ON r.run_seq = c.run_seq
        , lateral json_array_elements(c.content::json) x
WHERE r.run_id='2016-04-26T19_02_01_015Z'
LIMIT 10
        ;

Or: use the other answerers's suggestion with a similar join.


But possibly even better: replace the ugly run_id text string by an actual timestamp.

joop
  • 4,330
  • 1
  • 15
  • 26
  • Thanks! We have runs table. It's just a preference to have human readable keys at the expense of extra storage instead of integers. – Tomas Vitulskis May 10 '16 at 14:11
  • You complain about an `expensive plan`, but you don't want to fix your expensive data modelling faults? – joop May 10 '16 at 14:36