1

I have two tables in redshift:

  • tbl_current_day - about 4.5M rows
  • tbl_previous_day - about 4.5M rows, with the same data exactly as tbl_current_day

In addition to it, I have a view called qry_both_days defined as following:

CREATE OR REPLACE qry_both_days AS 
SELECT * FROM tbl_current_day
UNION SELECT * FROM tbl_previous_day;

When I run a query on one of the separate tables, I get very good performance as expected. For example, the following query runs 5 seconds:

select count(distinct person_id) from tbl_current_day;
-- (person_id is of type int)

Explain plan:

 XN Aggregate  (cost=1224379.82..1224379.82 rows=1 width=4)
   ->  XN Subquery Scan volt_dt_0  (cost=1224373.80..1224378.61 rows=481 width=4)
         ->  XN HashAggregate  (cost=1224373.80..1224373.80 rows=481 width=4)
               ->  XN Seq Scan on tbl_current_day  (cost=0.00..979499.04 rows=97949904 width=4)

Note that width is 4 bytes, as it's supposed to be, as my column is of type int.

HOWEVER, when I run the same query on qry_both_days the query runs 20 times slower, while I would expect it to run only 2 times slower, as it should go over twice more rows:

select count(distinct person_id) from qry_both_days;

Explain plan:

 XN Aggregate  (cost=55648338.34..55648338.34 rows=1 width=4)
   ->  XN Subquery Scan volt_dt_0  (cost=55648335.84..55648337.84 rows=200 width=4)
         ->  XN HashAggregate  (cost=55648335.84..55648335.84 rows=200 width=4)
               ->  XN Subquery Scan qry_both_days  (cost=0.00..54354188.49 rows=517658938 width=4)
                     ->  XN Unique  (cost=0.00..49177599.11 rows=517658938 width=190)
                           ->  XN Append  (cost=0.00..10353178.76 rows=517658938 width=190)
                                 ->  XN Subquery Scan "*SELECT* 1"  (cost=0.00..89649.20 rows=4482460 width=190)
                                       ->  XN Seq Scan on tbl_current_day  (cost=0.00..44824.60 rows=4482460 width=190)
                                 ->  XN Subquery Scan "*SELECT* 2"  (cost=0.00..90675.00 rows=4533750 width=187)
                                       ->  XN Seq Scan on tbl_previous_day  (cost=0.00..45337.50 rows=4533750 width=187)

The problem: width is now 190, not 4 bytes as it's supposed to be!!! Anybody knows how to make RedShift pick only the relevant columns on UNION SELECT?

Thanks!

diemacht
  • 2,022
  • 7
  • 30
  • 44
  • Since the data in the previous and current day tables should be disjoint, you should be able to get an additional performance increase by using UNION ALL, which will skip the DISTINCT step of the UNION operation. – Sim Dec 31 '13 at 17:51

2 Answers2

5

UNION used by itself removes duplicate rows, e.g., uses an implied DISTINCT, as per the SQL spec.

That means that a lot more processing is required to prepare the output.

If you do not want DISTINCT results then you should always use UNION ALL to make sure the database is not checking for potential dupes.

Joe Harris
  • 13,671
  • 4
  • 47
  • 54
  • 3
    This actually empirically seems to be the real issue, at least as of Q3 2015. I'm creating a view over the union of two full, same-width tables, one containing ~1B rows, and one containing ~8B rows...the difference between a view using `SELECT * [..] UNION SELECT [..]` and `SELECT * [..] UNION ALL SELECT * [..]` is quite stark. `UNION ALL` is about 5% more expensive than a single-table `SELECT`, while `UNION` is about 150x as expensive. Here's a gist with the annotated `EXPLAIN` output - https://gist.github.com/slpsys/5e43d8237fd8aa924015 – Marc Bollinger Sep 15 '15 at 21:16
1

Your view is created as SELECT *, so it always queries all the columns to create data for the view. Then another SELECT is used and only requested columns from the view are returned.

If you have limited number of selected columns (like a two, three sets that are used all the time), I'd create a separate view for each column set.

Another way (even less elegant than one before) is to call each view so its name says which columns are included (lets say sorted and separated with '__') - like qry_both_days__age__name__person_id. Then, before each query, check if required view exists, if not create it.

Tomasz Tybulewicz
  • 8,487
  • 3
  • 42
  • 44
  • Thanks for the advice. However, I can't tell ahead what would be the columns the user will query by. Is there any other way to write a UNION SELECT in RedShift such that it will fetch only the relevant columns only? – diemacht Nov 20 '13 at 12:22
  • 1
    @diemacht, take a look at my addon to the answer :) – Tomasz Tybulewicz Nov 20 '13 at 15:25
  • This type of column selection optimization is the reason why I sometimes prefer code-generating queries from a higher-level representation, e.g., a domain-specific language (DSL). When you take CTEs and multiple levels of aggregation/summarization into account, managing many custom column lists through several layers of views (or subqueries) becomes slow & error-prone. – Sim Dec 31 '13 at 17:55