3

I am confused about what Redshift is doing when I run 2 seemingly similar queries. Neither should return a result (querying a profile that doesn't exist). Specifically:

SELECT * FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
  Execution time: 36.75s

versus

SELECT COUNT(*) FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
  Execution time: 0.2s

Given that the table is sorted by project_id then id I would have thought this is just a key lookup. The SELECT COUNT(*) ... returns 0 results in 0.2sec which is about what I would expect. The SELECT * ... returns 0 results in 37.75sec. That's a huge difference for the same result and I don't understand why?

If it helps schema as follows:

CREATE TABLE profile (
    project_id integer not null,
    id varchar(256) not null,
    created timestamp not null,
    /* ... approx 50 other columns here */
)
DISTKEY(id)
SORTKEY(project_id, id);

Explain from SELECT COUNT(*) ...

XN Aggregate  (cost=435.70..435.70 rows=1 width=0)
  ->  XN Seq Scan on profile  (cost=0.00..435.70 rows=1 width=0)
        Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))

Explain from SELECT * ...

XN Seq Scan on profile  (cost=0.00..435.70 rows=1 width=7356)
    Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))

Why is the non count much slower? Surely Redshift knows the row doesn't exist?

AndySavage
  • 1,729
  • 1
  • 20
  • 34
  • Even when i cant answer your question. In the case of COUNT(*) you better use COUNT(index_field) that will give you even better performance. – Juan Carlos Oropeza Jul 04 '14 at 17:14
  • @JuanCarlosOropeza: Typically, the opposite is true. – Erwin Brandstetter Jul 04 '14 at 18:19
  • Select * is a bad practice in an analytical DB like Redshift. As you have many columns in your table, you should always focus on the ones that you need. As you can see from the EXPLAIN of your query the width is different. Together with the fact that Redshift is column based DB, it is not designed to do "lookups" fast. – Guy Jul 06 '14 at 08:08
  • @Guy, yeah totally. I see the EXPLAIN is wider too. If it was actually returning a result I would be happy with this explanation as it will have to scan each column looking for the row - obviously slow. Given that the COUNT(*) however knows the row is not there in 0.2s, I still don't understand what it is doing. Selecting one column (not the ID) will return no results in 9s - presumably as it's scanning that column which is slow. I can only think that when scanning all columns it does them in parallel and breadth first rather than depth hence finding "no result" takes longer. – AndySavage Jul 07 '14 at 13:36
  • @AndySavage you are certainly right that it can be done better. However, there is a secret agreement between users and developers, that the developer will do his best to support the best effort of the user. If the user can optimize it easily, the developer can focus on the hard problems. – Guy Jul 08 '14 at 08:46
  • From my personal experience as well, Count queries are much faster than usual selects in Redshift due to block level max min values stored in Redshift storage block header. – androboy Sep 18 '14 at 06:27

1 Answers1

1

The reason is: in many RDBMS's the answer on count(*) question usually come without actual data scan: just from index or table statistics. Redshift stores minimal and maximal value for a block that used to give exist or not exists answers for example like in describer case. In case requested value inside of min/max block boundaries the scan will be performed only on filtering fields data. In case requested value is lower or upper block boundaries the answer will be given much faster on basis of the stored statistics. In case of "select * " question RedShift actually scans all columns data as asked in query: "*" but filter only by columns in "where " clause.

Yuri Levinsky
  • 1,515
  • 2
  • 13
  • 26