1

What's the most efficient way to run a query encapsulated in another?

1. Querying inside a query:

select
  id_user
from table1
where
  foo in (select foo from table2 where dt_signin > '2014-01-01 00:00')

2. By using a temporary table:

create table tmp_table as
select
  foo
from table2
where
  dt_signin > '2014-01-01 00:00'

and then querying the temporary table

select
  id_user
from table1 t1
join tmp_table tmp
 on t1.foo = tmp.foo

Using method 1 is the query inside the where clause run for (# rows of table1) times or just once and stored in memory for further comparisons with foo?

Thiago
  • 652
  • 1
  • 9
  • 29
  • 2
    There's a `where exists` clause that might help you. Also see http://stackoverflow.com/questions/7471625/fastest-check-if-row-exists-in-postgresql – zedfoxus Aug 26 '14 at 17:51
  • A solution with two queries is usually slower than a single query - especially if the first query reads **and** writes data . –  Aug 26 '14 at 18:09

3 Answers3

1

The exists variant frequently outperforms the others

select id_user
from table1
where exists (
    select 1
    from table2
    where
        dt_signin > '2014-01-01 00:00'
        and
        table1.foo = foo
)

Notice that the outer table1.foo is compared to foo in the subquery

Clodoaldo Neto
  • 118,695
  • 26
  • 233
  • 260
1

Your two queries do not do the same thing. For the second approach to be equivalent, you need select distinct:

select distinct id_user
from table1 t1 join
     tmp_table tmp
     on t1.foo = tmp.foo;

Because of this additional operation, I might expect the in to perform better. But, in general, when you have a specific performance question, you should test it on your data on your system.

As for your queries in the question, there are many different ways to approach this. Here are a few:

  • Subquery with in
  • Two tables with a join and distinct
  • Two tables with in
  • Two tables with exists
  • Subquery with exists
  • Two tables with exists
  • CTE with a join
  • CTE with in
  • CTE with exists

In an ideal world, SQL compilers would simply study the query and get the best execution plan, regardless of how the query is expressed. That world is not the world we live in, alas.

One reason a temporary table is sometimes useful (and I much prefer single query solutions) is for optimization purposes:

  1. The statistics about the temporary table are known, so the optimizer can choose a better plan.
  2. You can build indexes on temporary tables to improve performance.

Your subquery is not very complicated, so these probably are not issues.

Under different circumstances, different approaches might work better. As a default, I would build an index on tmp_Table(dt_signin, foo) and use exists.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Would a subquery with **in** require that subquery to be executed for each row of the first table? (refer to the method 1 in the question) – Thiago Aug 28 '14 at 16:05
  • @Thiago . . . I do not know if Postgres always executes such queries sequentially. Some databases do. Others have optimizations for `in` with a subquery. The real answer for a particular situation is to use `explain`. – Gordon Linoff Aug 28 '14 at 20:13
0

What about :

select
  id_user
from 
  table1 t1
  join (
    select foo from table2 where dt_signin > '2014-01-01 00:00'
  )  t2 ONt1.foo = t2.foo

There is no need to create a temporary table if you can use a join.

Anthony Raymond
  • 7,434
  • 6
  • 42
  • 59