Revised 28 Nov 2010
or am I doing something wrong?
Yes. You are aware of two simple sets that you know, and you are focussed on producing a third set from those two sets.
Taking a Relational approach (The Relational Model; Codd & Date) to the problem means thinking it terms of sets. Both your original sets are projections of data in the database. So rather than combining them, the results, as is, into a third set, you need to think only of a new set, a new projection directly from the data in the database.
Second, after you do that, taking a straight-forward approach to the SQL language to obtain the result, is easy. If the (1) is understood, the programming required is a simple, logic problem. Sure, the ISO/IEC/ANSI SQL Standard permits various code constructs, and there are many ways to obtain the same result; what I am talking about is program logic; simplicity; efficiency; rather than not.
This will execute faster on any ISO/IEC/ANSI SQL-compliant platform.
- It is well-known that Oracle is not compliant, or stretches the definitions of the Standards a great deal, in order to appear to be compliant, so I accept this code may not execute faster on Oracle. Specifically, Oracle people appear to know that scalar subqueries (which logic and simplicity demand here) do not perform well, and they use other constructs instead.
- Any such alternate construct is a departure from (1) and (2). (Eg. Tony Andrews' answer takes OP's question literally, uses the two known sets; joins them; feeds them into a 3rd set; then extracts 3 columns from that.)
Try this, it is the set-oriented method of manipulating Relational databases, producing the new single set directly from the data (1), without regard to the two known sets. It then uses the SQL language in a straight-forward, simple, logical manner (2) to obtain that requirement. Yes, that happens to be two scalar subqueries (returns a single value):
SELECT Producers.name Prod,
(SELECT COUNT(Animals.idanimal)
FROM Animals,
AnimalsBought
WHERE p.nif = Animals.owner
AND Animals.idanimal = AnimalsBought.idanimal
) AnimalsBought,
(SELECT COUNT(Animals.idanimal)
FROM Animals,
AnimalsExploration
WHERE p.nif = Animals.owner
AND Animals.idanimal = AnimalsExploration.idanimal
) AnimalsExploration
FROM Producers p
Comparison 29 Nov 2010
This section is only relevant for people interested the testing Tony and I are involved in, and the conclusions thereof
It bothered me that Tony reported Oracle executes his query in 2 secs vs my query in 5 secs. I could not reconcile myself with the idea that code that is twice as long; that uses 3 times the number of sets, could run faster.
I ran a test on Sybase. I already have a benchmark structure available, which would allow very similar code to be tested (the test on my tables did not need to join two tables but I included that, so that it is as close to OP's question as possible). 2.0M CustomerTransactions (Animals) spread across 100 Customers (Producers). That is, both queries are producing the same result set (100 Customers (Producers) by 3 columns) from 2.0M data points or 100 x 20,000 scalars. Let's identify them accurately, so that we can understand and compare them better. First the DDL for the structures used:
Link to DDL
1_Set_Scalar_Subquery
An exact equivalent of my code above, rendered for the available benchmark structures:
- From OP's description, the structure of the requirement (1) is a list of Customers (Producers).
- That is easily translated (2) into a simple query:
SELECT ... FROM Customer (Producer)
- Next we need two more columns, specific Things (1) re Customer (Producer), which is data that can be derived (not readily available as physical columns) from the database
- This is easily done (2) with two scalar subquery queries
Link to 1_Set_Scalar_Subquery Code & Timings
1083, 1073, 1103, 1073, 1080 ms
Sybase Stats show 3 scans:
Customer (Producer) scanned once
CustomerTransaction (Animal) scanned twice (once for each scalar subquery)
3_Set_Inline_Query
An exact equivalent of Tony's code, rendered for the available benchmark structures:
Link to 3_Set_Inline_Query Code & Timings
1820, 1850, 1846, 1843, 1850 ms
That is 70% slower.
Sybase Stats show 9 scans:
Customer (Producer) scanned twice (outer query)
CustomerTransaction (Animal) scanned twice (the 2 inline sets)
Worktable 2 scanned 3 times (the merged set)
Worktables 1 & 3 scanned once each (for the GROUP BYs
)
Of course the 70% difference will be exaggerated on a busy server.
3_Set_Inline_Query_Improved
Since the original sets provided by OP was inefficient, ie. it could be easily improved, the 2 x GROUP BYs
which require worktables could be removed, etc; to produce something that compares more reasonably, while remaining inline, versus the scalar subquery, I did. Of course, that means, instead of manipulating a large joined set and then GROUP BY
to obtain aggregates; to use (1) and (2) manipulate just the required set, and fill the aggregates in using scalar subqueries.
The purpose of this was to test the veracity of my statement that Sybase performs both scalar subqueries and inline queries with the same efficiency. At least now we are comparing two good code segments each of a completely different structure, instead of one good segment code against one poor code segments.
Link to 3_Set_Inline_Query_Improved Code & Timings
1103, 1073, 1103, 1073, 1080 ms
That is practically identical to 1_Set_Scalar_Subquery
Sybase stats show 4 scans:
Customer (Producer) scanned once
CustomerTransaction (Animal) scanned twice (once for each scalar subquery; even though there are 2 inline sets)
Customer (Producer) scanned once more (for the outermost query)
Therefore it is proved (in this simple test), that Sybase performs the Inline Queries exactly the same as Scalar Subqueries. Given that in this case, it is 3 sets vs 1 set, I would say that the Optimiser has done a great job in Normalising the query (not flattening as Tony inquired, and I responded), in order to produce identical performance results.
The other thing this exercise has proved is, the importance of taking the (1), (2) approach. Assuming even reasonable Normalisation, data in Relational database is best approached with a Relational, set-oriented mindframe. That leads to simpler, more efficient code. Of course, smaller, simpler, normalised code performs better.
Except on Oracle, where due to its peculiarities and limitations, people have to avoid that, and use constructs that are known not to perform badly.
Comparison 01 Dec 2010
Tony posted the results of his tests. They speak for themselves, case closed.
However he introduced the issue issue that although in this particular case (this question) Oracle executes the 1_Set_Scalar_Subquery almost twice as fast as the 3_Set_Inline_Query, because the COUNT() was serviced from the index without accessing the table, that in general Oracle performs Inline queries much better than Scalar Subqueries, and tested SUM() in order to examine the effect of visiting the tables.
I have therefore run the same test in Sybase, with COUNT() changed to SUM(). As with Oracle, the COUNT() was a covered query, but the SUM() now demands visiting the tables.
(Conclusions later, when Tony has posting some missing details. One thing that is bothering me is, Sybase deals with 2.0M rows in under 2 seconds; Oracle even at its best, does not appear to be able to deal with the same work with 1000++ rows in 6 seconds (table populations pending) .)
Tony has posted some missing details, but also opens new issues which I have identified in my comments on his post. Therefore it would still be a bit premature to draw definitive conclusions
So far, in similar tests (not close enough for my liking), on similar systems (ok Sybase system has 2 additional s/w layers and is not a server):
Sybase deals with COUNT 4.0M rows in 1.1 and 1.5 seconds;
Oracle does similar work in 3.5 and 5.7 seconds.
Sybase does the SUM in 2.1 and 3.0 secs;
Oracle did not complete the subquery and did the inline in 79 secs
(further info requests pending).
In the best case (Subquery did not complete, so we can't count that), using poor code as demanded for Oracle, Oracle is 26 times slower than Sybase.
And 40 times slower where the Sybase coder used Relational set theory.
1_Set_Scalar_Subquery_SUM
Same as 1_Set_Scalar_Subquery_SUM, with COUNT() changed to SUM().
Link to 1_Set_Scalar_Subquery_SUM Code & Timings
2173, 2153, 2163, 2153, 2173 ms
Sybase Stats show 3 scans:
Customer (Producer) scanned once
CustomerTransaction (Animal) scanned twice (once for each scalar subquery)
3_Set_Inline_Query_SUM
Same as 3_Set_Inline_Query, with COUNT() changed to SUM().
Link to 3_Set_Inline_Query_SUM Code & Timings
3033, 2993, 3033, 2993, 3013 ms
That is 38% slower.
Sybase Stats show 9 scans:
Customer (Producer) scanned twice (outer query)
CustomerTransaction (Animal) scanned twice (the 2 inline sets)
Worktable 2 scanned 3 times (the merged set)
Worktables 1 & 3 scanned once each (for the GROUP BYs
)
Of course the 38% difference will be exaggerated on a busy server.
3_Set_Inline_Query_Improved_SUM
Same as 3_Set_Inline_Query_Improved, with COUNT() changed to SUM().
Link to 3_Set_Inline_Query_Improved_SUM Code & Timings
2183, 2180, 2183, 2143, 2143 ms
That is practically identical to 1_Set_Scalar_Subquery
Sybase stats show 4 scans:
Customer (Producer) scanned once
CustomerTransaction (Animal) scanned twice (once for each scalar subquery; even though there are 2 inline sets)
Customer (Producer) scanned once more (for the outermost query)
One thing within the Sybase context. My assertion that the Sybase Optimiser Normalises queries, that therefore it handles Scalar Subqueries as well as it handles Inline Views, is further proved. Poor code (3_Set_Inline_Query unimproved) runs slower, but only proportionately slower, no dramatic difference.