I have a table with hundreds of millions of rows, where I want to get a single list of the unique values from 2 indexed columns from the same table (with no unique row ID).
To illustrate this, let's say we have a table with a fruits
column and a veggies
column, and I want to build a healthy_foods
list with the unique values in both columns.
I have tried the following queries:
with UNION
WITH cte as (
SELECT fruit, veggie
FROM recipes
)
SELECT fruit as healthy_food
FROM cte
UNION -- <---
SELECT veggie as healthy_food
FROM cte;
with UNION ALL then DISTINCT ON
WITH cte as (...)
SELECT DISTINCT ON (healthy_food) healthy_food FROM -- <---
(SELECT fruit as healthy_food
FROM cte
UNION ALL -- <---
SELECT veggie as healthy_food
FROM cte) tb;
with UNION ALL then GROUP BY
WITH cte as (...)
SELECT fruit as healthy_food
FROM cte
UNION ALL -- <---
SELECT veggie as healthy_food
FROM cte
GROUP BY healthy_food; -- <---
(and adding HAVING COUNT(*) = 1
and GROUP BY
on each SELECT from the UNION)
The UNION ALL gets executed super fast, but all the duplicate-removing combinations I've tried take +15mins.
How could I optimize this query, taking into account that the 2 fields/columns come from the same table and are indexed?
(or alternatively, what would be the least expensive way to keep track of all unique values? maybe a trigger inserting on a UNIQUE table, or a view?)