2

In each table data there are more than 10 thousand records. What query results faster when listing this data. Is a nested query faster than a join query?

Nested query:

$query = $baglanti->prepare("Select * from table_1");
$query->execute();
if($query->rowCount() > 0){
    while($query_data=$query->fetch(PDO::FETCH_ASSOC)){
        $query_2 = $baglanti->prepare("Select * from table_2 where table_1_id = :id");
        $query_2->bindParam(":id", $query_data["id"], PDO::PARAM_INT);
        $query_2->execute();
        if($query_2->rowCount() > 0){
            while($query_2_data=$query_2->fetch(PDO::FETCH_ASSOC)){
                $query_3 = $baglanti->prepare("Select * from table_3 where table_2_id = :id");
                $query_3->bindParam(":id", $query_2_data["id"], PDO::PARAM_INT);
                $query_3->execute();
                if($query_3->rowCount() > 0){
                    while($query_3_data=$query_3->fetch(PDO::FETCH_ASSOC)){
                        table_4
                        table_5
                        ...
                    }
                }
            }
        }
    }
}

Inner join query:

$query = $baglanti->prepare("Select * from table_1
    INNER join table_2
    on table_1.id=table_2.table_1_id
    INNER join table_3
    on table_2.id=table_3.table_2_id
    INNER join table_4
    on table_3.id=table_4.table_3_id
    INNER join table_5
    on table_4.id=table_5.table_4_id
    ...
    ");
$query->execute();
if($query->rowCount() > 0){
    while($query_data=$query->fetch(PDO::FETCH_ASSOC)){

    }
}

which results faster? (I've indexed all tables)

Mehravish Temkar
  • 4,275
  • 3
  • 25
  • 44
  • 5
    Inner joins should work faster. Nested queries almost always run slower – Brad May 03 '18 at 18:22
  • 3
    I can't imagine a scenario where nested queries would run faster. Each one has to open a session, send the sql in, parse the sql, optimize/determine execution path, run execution path, push data back over the pipe to the application. With a JOIN it's one trip to the database, the database then gets it all at once and joins it (which it's specifically programmed to do CRAZY efficiently) and then return one result set. – JNevill May 03 '18 at 18:23
  • 1
    A `JOIN` will let the database optimizer do a better job of building out how it retrieves the data and will let it filter data that it doesn't need to bother with. – Shawn May 03 '18 at 18:33
  • @JNevill if the "parent data" included huge chunks of data that would be multiplied by large numbers of small "child data", nested could be better... but even then there are better non-nested options. – Uueerdo May 03 '18 at 18:46

1 Answers1

4

The answer, as with most things database, is that it depends.


First, lets note that the results from the two query patterns are different in a significant way.

As an example, consider what happens when table_3 is empty (contains no rows). With the JOIN query pattern, we won't get anything back... the resultset will contain zero rows.

The other query, running separate queries of each table, will return rows from table_1 and table_2.

Also, with the JOIN pattern, we are going to be returning redundant copies of data from table_1, table_2...


But in terms of "faster", typically the JOIN pattern will be faster, because it eliminates a LOT of the roundtrips to the database. Sending the SQL, parsing the tokens, semantics check, developing an execution plan, executing the plan, preparing a resultset, returning the resultset to the client, waiting for the client to do a fetch, and then cleaning up (closing the statement handle and discarding the resultset.)

When the number of database round trips increase dramatically, that small overhead of each statement execution starts to accumulate in signifigance.

The upside is that with a simple query, there tends to be a lower number of execution paths to consider, and we typically get a reasonably efficient plan with each query (assuming suitable indexes are available.


The risk with the JOIN pattern is that we can generate a very large set, that includes a lot of redundant data on each row.


Let's consider a scenario:

If we've got 1,000 rows in table_1.

And if we've 1,000 rows in table_2 for each row in table_1.

And if we've got 100 rows in table_3 for each row in table_2.

And if we've got 10 rows in table_4 for each row in table_3.

And if we've got 1 row in table_5 for each row in table_4.

Some quick math here... 10^3 ^ 10^3 * 10^2 * 10^1 * 10^0 = 10^9

That will result in one billion rows in the resultset. The data for each row in table_1 is going to be repeated 10^6 times. That's one million copies of the same table_1 values.

We have potential for a "very large" resultset, and a corresponding increase on resource demands, which can incur a performance penalty.

So we tend towards a middle ground. We prefer to process sets rather than processing RBAR (row by agonizing row), but we also want to avoid Hugh Jass resultsets.

Optimal performance might be obtained somewhere between those two approaches. For example, by processing individual rows from table_1 in a loop, and for each row retrieved, we run a query against the remaining four tables in a JOIN to get back a combined result.

spencer7593
  • 106,611
  • 15
  • 112
  • 140