Is there a way to get data from 2 tables without creating a Cartesian product?

Question

In our database a customer can have any number of drivers, any number of vehicles, any number of storage locations, any number of buildings at those locations, any number of comments, and so on. I need a query that returns all of the customer's information and right now the query is something like:

SELECT *
FROM Customer c
INNER JOIN Driver d ON c.ID = d.CustomerID
INNER JOIN Vehicle v ON c.ID = v.CustomerID

The more that a customer has the bigger the result gets, and it grows exponentially because a cartesian product is being created here. 3 drivers, 3 vechiles creates 9 rows, and this is a very small example compared to what our real data is like. We actually have 10 different tables that can hold as many rows per customer as they want. The norm is 2-7 rows at least per table per customer. we have had as many as 60,000,000+ rows returned (6 items each in 10 different tables, 6^10 = 60,466,176) and for our purposes 6 rows total would have given us all the data we needed if we could just stick the 6 rows in each table together.

so in the smaller example, if 1 customer had 2 vehicles and 3 drivers and another customer had 2 vehicles and 1 drivers i would want a result set that looked like:

CustomerID  | DriverID | VehicleID
1           | 1        | 1
1 (or NULL) | 2        | 2
1 (or NULL) | NULL     | 3
2           | 3        | 4
2 (or NULL) | 4        | NULL

Instead our query that joins every table together on CustomerID looks like this:

CustomerID | DriverID | VehicleID
1          | 1        | 1
1          | 1        | 2
1          | 1        | 3
1          | 2        | 1
1          | 2        | 2
1          | 2        | 3
2          | 3        | 4
2          | 4        | 4

Really, what I want to do is just:

SELECT * FROM Driver
SELECT * FROM Vehicle

Because all we are doing with the data is looping through the rows and formatting the information in a document. All drivers are listed, then all vehicles are listed. It makes no sense to do this crazy huge join when we don't have to, but it's just an arbitrary requirement that it must return all the data in 1 result set from a stubborn superior who refuses to listen to reason. Since the columns are different a UNION isn't possible. i'm just hoping there's a way to stick them together horizontally instead of vertically.

Also, I'm using Microsoft SQL Server.

Do you really need the data returned as individual rows? What do you use it for afterwards? You could use an operator which aggregate drivers into a pseudocolumn, vehicles in another, etc etc and group by customer ID, no? — fge, Dec 19 '11 at 23:01
How do you know that Driver one only is associated with vehicle 1? Each driver could be associated with each vehicle, which is what your query is showing. Is there some additional field(s) to describe the other links (i.e. drive 1 gets vehicle 1) — Sparky, Dec 19 '11 at 23:03
Sparky, drivers and vehicles have no association. They are both associated to the customer. I don't care which vehicles are on the same row with which drivers, all I care about is that every vehicle and every driver is in the results. — Nick, Dec 19 '11 at 23:08
fge, what do you mean by pseudocolumn? I am not familiar with that. — Nick, Dec 19 '11 at 23:10
So you want to make sure every related ID is included only once?, not the combinations? A pseudo column is essentially a calculation done as part of a query (non-technical answer), such as quantity*price as Total_order_price — Sparky, Dec 19 '11 at 23:21
Yes, that's is correct. I don't want the combinations. And, oh, then I use pseudocolumns all the time I just didn't know they were called that. How could I use pseudocolumns to solve this problem, though? — Nick, Dec 19 '11 at 23:22
@Nick this is what I was talking about: http://stackoverflow.com/questions/451415/simulating-group-concat-mysql-function-in-ms-sql-server-2005 — fge, Dec 19 '11 at 23:28
That would not work, I still need to be able to access individual columns. — Nick, Dec 19 '11 at 23:31
A Cartesian product is a product of 2 sets. When you have 3 drivers for a customer and 3 vehicles for a customer and your result is 9 rows, 1 row for every possible combination of those 3 drivers and 3 vehicles, that absolutely is the text-book definition of a Cartesian product. My set is not every vehicle in the table, it's every vehicle for a customer. so it's a smaller Cartesian product, but a Cartesian product none-the-less. — Nick, Dec 20 '11 at 00:15
Nick, you are wrong and @a_horse_with_no_name is right. An `INNER JOIN` is **not** a Cartesian product. It's not being "nitpicky about semantics." Let's not bastardize mathematical definitions, please. — NullUserException, Dec 20 '11 at 03:01
What are you talking about? It's a simple definition to find, especially using the Interet: Cartesian prouduct - noun - Mathematics - A set of all pairs of elements (x,y) that can be constructed form given sets, X and Y, such that x belongs to X and y belongs to Y. Set X = Drivers for customer 1. Set Y = Vehicles for customer 1. A set of all ordered pairs of those 2 sets is literally **by definition** a Cartesian product. You say "let's not bastardize mathematical definitions" but that's exactly what you're doing. — Nick, Dec 20 '11 at 07:27
@NullUserException: `INNER JOIN` is not a mathematical definition. In SQL, `INNER JOIN` is a specialized Cartesian product e.g. can one may re-write an `INNER JOIN` using product (`CROSS JOIN`) and restriction (`WHERE`). Consider that Codd's algebra had product and restriction but no `JOIN` operator yet was still relationally complete (a rename operator aside). `CROSS JOIN` is not a Cartesian product because it doesn't produce pairs (I don't know what to call what does produce, though!) And yes, this is mere nitpicking about semantics :) — onedaywhen, Dec 20 '11 at 09:05
Your query is the correct best way to do this. I don't even see why you have a problem with it. — HLGEM, Dec 20 '11 at 18:31
I have a problem with it because it returns more than 60 million rows. and after joining to 10 different tables there are over 500 columns. that's more than 30 billion cells or pieces of data. — Nick, Dec 20 '11 at 19:06
Ok, I figured it out. I can use a pseudocolumn for row number of each item for the same customer and then full outer join the tables on customerID and row number. — Nick, Dec 21 '11 at 16:33

score 3 · Answer 1 · answered Dec 19 '11 at 23:40

3

It's an ugly hack, but you know your proper solution is just as you state:

SELECT * FROM Driver
SELECT * FROM Vehicle

Instead you could use a union query and blank out the columns from the other tables, just start it with a query that sets the type and names of the columns, with a false coldition so it doesn't return a row:

SELECT 1 AS DriverID, "" AS DriverName, 1 AS VehicleID, "" AS VehicleName WHERE 1=0 
UNION SELECT DriverID, DriverName, NULL, NULL FROM Driver
UNION SELECT NULL, NULL, VehicleID, VehicleName FROM Driver

Really, really bad code! Keep working on your superior to allow a better solution.

answered Dec 19 '11 at 23:40

Stephen Turner

7,125
4
51
68

+1 I agree this is an ugly hack but, given the circumstances, it's what I would have proposed as well. :) – Dan J Dec 20 '11 at 01:10
The example I gave is tiny compared to the amount of data we actually have. After joining to all 10 tables we can easily have well over 500 columns. manually naming each column and typing all those nulls would be a huge pain. – Nick Dec 20 '11 at 19:10

score 1 · Accepted Answer · answered Dec 21 '11 at 17:18

Here's how I'm doing it. Instead of:

SELECT *
FROM Customer c
INNER JOIN Driver d ON c.ID = d.CustomerID
INNER JOIN Vehicle v ON c.ID = v.CustomerID

I'm doing:

WITH CustomerCTE AS
(
  SELECT 1 ROW_NUM, ID
  FROM Customer
),
DriverCTE AS
(
  SELECT ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY ID) ROW_NUM, *
  FROM Driver
),
VehicleCTE AS
(
  SELECT ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY ID) ROW_NUM, *
  FROM Vehicle
)
SELECT *
FROM CustomerCTE c
FULL OUTER JOIN DriverCTE d ON c.ID = d.CustomerID AND c.ROW_NUM = d.ROW_NUM
FULL OUTER JOIN VehicleCTE v ON d.CustomerID = v.CustomerID AND d.ROW_NUM = v.ROW_NUM
ORDER BY
CASE WHEN c.ID IS NOT NULL THEN c.ID ELSE
  CASE WHEN d.CustomerID IS NOT NULL THEN d.CustomerID ELSE
    v.CustomerID
  END
END,
CASE WHEN c.ROW_NUM IS NOT NULL THEN c.ROW_NUM ELSE
  CASE WHEN d.ROW_NUM IS NOT NULL THEN d.ROW_NUM ELSE
    v.ROW_NUM
  END
END

Now if a customer has 3 drivers and 3 vehicles i get 3 rows instead of 9 rows. It makes it look like each driver is associated to 1 of the 3 vehicles, but it's actually not. Again, this is bad design, but it is necessary to cut down on the number of rows returned with the unreasonable restrictions I was given.

It looks like more work than webturner's answer, but in my real case where I have to join 10 different tables with over 500 columns its a lot less work to do it this way than to explicitly name all 500 columns and fill in all of the remaining columns from each table with NULL.

Though, this may not be of much use to most people. In most cases if you're doing something like this you probably need to rethink your design, but there may be some cases where you have no choice.

Is there a way to get data from 2 tables without creating a Cartesian product?

2 Answers2