2

Recently I'm getting familiar with Clojure and I am amused with an idea of lazy sequence evaluation which compute values only when it is necessary.

I work a lot with PostgreSQL DB and I experienced different performance of queries when LIMIT clause is used. For example query

SELECT * FROM(
    SELECT id FROM foo1
    INTERSECT
    SELECT id FROM foo2) AS subquery
LIMIT 50 

will have the same execution time like

SELECT id FROM foo1
INTERSECT
SELECT id FROM foo2.

This suggests that Postgres firstly evaluates the whole result and then just get first 50 rows. This behaviour is in opposite to the idea of laziness, because DB process data that it is not required to get final answer. But on the other hand query

SELECT * FROM foo1 INNER JOIN foo2 ON foo1.id=foo2.id LIMIT 50

performs much better than

SELECT * FROM foo1 INNER JOIN foo2 ON foo1.id=foo2.id.

Does somebody know which Postgres operations supports such LIMIT laziness?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
bkolasa
  • 75
  • 2
  • 5
  • 2
    Note: `LIMIT` without `ORDER BY` makes little or no sense. – wildplasser Dec 27 '14 at 19:23
  • 1
    For questions concerning query plans or performance please ***always*** include your version of Postgres and exact table definitions of your case. Read the tag info for [`[postgresql-performance]`](http://stackoverflow.com/tags/postgresql-performance/info). Indexes and constraints are particularly relevant. – Erwin Brandstetter Dec 28 '14 at 00:13
  • 1
    @wildplasser: As long as an *arbitrary* selection is fine, `LIMIT` can make sense without `ORDER BY`. Many newcomers are not aware of the details, though, so we see mostly incorrect queries like that in questions here on SO. – Erwin Brandstetter Dec 28 '14 at 01:25

1 Answers1

1

For starters, your queries are not equivalent unless id is defined unique in both tables. INTERSECT handles duplicates differently from INNER or OUTER JOIN. The missing key word ALL makes the difference even bigger. Per documentation:

The result of INTERSECT does not contain any duplicate rows unless the ALL option is specified. With ALL, a row that has m duplicates in the left table and n duplicates in the right table will appear min(m,n) times in the result set.

Joins on the other hand produce a Cartesian product, that is m*n rows for matching duplicates. So the query plans cannot use the same code path.

To get results that differ less (but still not equivalent except for unique id), use instead:

SELECT id FROM foo1
INTERSECT ALL  -- don't fold dupes
SELECT id FROM foo2
LIMIT 50;

SELECT * FROM foo1 JOIN foo2 USING (id) LIMIT 50; -- return single id column

Here is a fiddle to play with (pg 9.3.1). The version on sqlfiddle.com is pretty outdated by now. Rather test with the latest version yourself.

A lot more time and smarts went into optimization of joins which are more commonly used by several orders of magnitude. I hardly ever use INTERSECT, because it often yields inferior query plans. In a quick test on pg 9.3 I could only get sequential scans out of INTERSECT where joins use much faster index scans. I wouldn't know of any news concerning INTERSECT in pg 9.4.

There would be potential for improvement, especially where unique indexes are involved. I guess nobody cared enough to work on it much since INTERSECT is not as popular as other operations.

I do know of very nice use cases for UNION ALL in combination with LIMIT to benefit from "lazy evaluation", though:

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228