6

I have a table with hundreds of millions of rows, where I want to get a single list of the unique values from 2 indexed columns from the same table (with no unique row ID).

To illustrate this, let's say we have a table with a fruits column and a veggies column, and I want to build a healthy_foods list with the unique values in both columns.

I have tried the following queries:

with UNION

WITH cte as (
    SELECT fruit, veggie
    FROM recipes
)
SELECT fruit as healthy_food
         FROM cte
         UNION --  <--- 
         SELECT veggie as healthy_food
         FROM cte;

with UNION ALL then DISTINCT ON

WITH cte as (...)
SELECT DISTINCT ON (healthy_food) healthy_food FROM  --  <--- 
(SELECT fruit as healthy_food
     FROM cte
     UNION ALL --  <--- 
     SELECT veggie as healthy_food
     FROM cte) tb;

with UNION ALL then GROUP BY

WITH cte as (...)
SELECT fruit as healthy_food
         FROM cte
         UNION ALL --  <--- 
         SELECT veggie as healthy_food
         FROM cte
GROUP BY healthy_food; --  <--- 

(and adding HAVING COUNT(*) = 1 and GROUP BY on each SELECT from the UNION)

The UNION ALL gets executed super fast, but all the duplicate-removing combinations I've tried take +15mins.

How could I optimize this query, taking into account that the 2 fields/columns come from the same table and are indexed?

(or alternatively, what would be the least expensive way to keep track of all unique values? maybe a trigger inserting on a UNIQUE table, or a view?)

Z. M.
  • 329
  • 5
  • 13
  • What's wrong with `select fruit from recipes union select veggies from recipes`? If both columns are indexed separately, the engine should scan the indexes without hitting the heap even once. – The Impaler Apr 26 '21 at 13:36
  • 1
    If you are showing a list of hundreds of millions of rows to the user, that's bound to take some time. But... does it make sense in the first place? – The Impaler Apr 26 '21 at 13:39
  • @TheImpaler just `UNION` takes 15 minutes, while the ones with `GROUP BY` clauses take <6min. Maybe I should change the index type? It's a text field. (Not showing the entire list anywhere, I just want to do some processing on each unique value - only once) – Z. M. Apr 26 '21 at 14:28
  • Roughly how many duplicates among fruits, among veggies, and *between* fruits and veggies? Actual (relevant parts of) table definition, and always your version of Postgres would be instrumental, too. – Erwin Brandstetter Apr 26 '21 at 15:16

3 Answers3

4

If there are many duplicates among fruits and / or veggies, but not so many between fruits and veggies (like the names in your example suggest), and since you have an index for both of them, emulating an index skip scan (a.k.a. loose index scan) will work wonders:

WITH RECURSIVE fruit AS (
   (
   SELECT fruit
   FROM   recipes
   ORDER  BY 1
   LIMIT  1
   )
   UNION ALL
   SELECT (SELECT fruit
           FROM   recipes
           WHERE  fruit > t.fruit
           ORDER  BY 1
           LIMIT  1)
   FROM   fruit t
   WHERE  t.fruit IS NOT NULL
   )
 , veggie AS (
   (
   SELECT veggie
   FROM   recipes
   ORDER  BY 1
   LIMIT  1
   )
   UNION ALL
   SELECT (SELECT veggie
           FROM   recipes
           WHERE  veggie > t.veggie
           ORDER  BY 1
           LIMIT  1)
   FROM   veggie t
   WHERE  t.veggie IS NOT NULL
   )
SELECT DISTINCT healthy_food
FROM  (
   SELECT fruit AS healthy_food FROM fruit
   UNION  ALL
   SELECT veggie AS healthy_food FROM veggie
   ) sub
WHERE  healthy_food IS NOT NULL;

Just DISTINCT instead of DISTINCT ON (like you tried) in the outer SELECT, since we are dealing with a single column.

See:

You might as well use UNION instead of UNION ALL + DISTINCT in the outer SELECT. Only avoided that because you explicitly asked for it. But I don't see the point.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • 3
    This one has blown my mind! `UNION` by itself takes 16 minutes, the `GROUP BY` variants take <6mins, but this one did it in... **33 SECONDS!** Hats off to you (and I will further investigate this topic, as your reply has incited great curiosity in me) – Z. M. Apr 26 '21 at 16:23
  • If there number of distinct values is [very] small compared to the total number of rows, this strategy will indeed be blazing fast. +1 – The Impaler Apr 26 '21 at 16:50
1

I would suggest:

select distinct r.fruit
from recipe r
union all
select distinct r.veggie
from recipe r
where not exists (select 1 from recipe r2 where r2.fruit = r.veggie);

Then for this query, you want indexes on:

  • recipe(fruit)
  • recipe(veggie)

If your statistics are up-to-date and you are using a recent version of Postgres, then the select distinct should use the index. And then the not exists should use it as well. The query should not have any additional duplicate elimination.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

If you have separate indexes on fruit and veggie this query should be reasonably quick

SELECT fruit AS healthy FROM recipes
 UNION 
SELECT veggie AS healthy FROM recipes;

But you'll need separate indexes on those two columns so each side of the UNION can have a useful index.

You could use a materialized view for this application.

CREATE OR REPLACE MATERIALIZED VIEW healthy_food AS 
SELECT fruit AS healthy FROM recipes
 UNION 
SELECT veggie AS healthy FROM recipes;
CREATE INDEX healthy ON healthy_food (healthy);
...

SELECT healthy FROM healthy_food ORDER BY healthy;

Selecting from the view will be fast. But you'll have to routinely update the view like this. The refresh operation will incur the cost of the base query.

REFRESH MATERIALIZED VIEW healthy_food;

If your application can tolerate it being out of date between refreshes, this is a good way to proceed. If the base query is still too expensive you can do the refresh once a day at zero-dark-thirty.

O. Jones
  • 103,626
  • 17
  • 118
  • 172