2

SQL Fiddle.

I'm having a slow start to the morning. I thought there was a more efficient way to make the following query using a join, instead of two independent selects -- am I wrong?

Keep in mind that I've simplified/reduced my query into this example for SO purposes, so let me know if you have any questions as well.

SELECT DISTINCT c.* 
FROM   customers c
WHERE  c.customer_id IN (select customer_id from customers_cars where car_make = 'BMW')
  AND  c.customer_id IN (select customer_id from customers_cars where car_make = 'Ford')
;

Sample Table Schemas

-- Simple tables to demonstrate point
CREATE TABLE customers (
  customer_id serial,
  name text
  );

CREATE TABLE customers_cars (
  customer_id integer,
  car_make text
  );


-- Populate tables
INSERT INTO customers(name) VALUES
  ('Joe Dirt'),
  ('Penny Price'),
  ('Wooten Nagen'),
  ('Captain Planet')
;

INSERT INTO customers_cars(customer_id,car_make) VALUES
  (1,'BMW'),
  (1,'Merc'),
  (1,'Ford'),
  (2,'BMW'),
  (2,'BMW'),      -- Notice car_make is not unique
  (2,'Ferrari'),
  (2,'Porche'),
  (3,'BMW'),
  (3,'Ford');
-- ids 1 and 3 both have BMW and Ford

Other Expectations

  • There are ~20 car_make in the database
  • There are typically 1-3 car_make per customer_id
  • There is expected to be not more than 50 car_make assignments per customer_id (generally 20-30)
  • The query is generally only going to look for 2-3 specific car_make per customer (e.g., BMW and Ford), but not 10-20
vol7ron
  • 40,809
  • 21
  • 119
  • 172
  • Crucial: Is the combination `(customer_id, car_make)` unique? – Erwin Brandstetter Nov 05 '14 at 15:40
  • That is crucial and it is not unique -- *bastos.sergios* made me recognize this, I'll update the question. There can be multiple car_make per customer_id. – vol7ron Nov 05 '14 at 15:44
  • I updated your fiddle accordingly. – Erwin Brandstetter Nov 05 '14 at 15:49
  • Thanks Erwin, I was testing out other answers and only updated it there for testing. Feel free to update the question. – vol7ron Nov 05 '14 at 15:50
  • Are we only ever interested in two cars at a time? Or very few? Or can this grow to a long list? – Erwin Brandstetter Nov 05 '14 at 15:51
  • I would say there are about 20 *cars* and a customer tends to take on 1-3. But to get at what I think you're asking, I think there will be 50 or less assignments per customer_id and do not intend for it to grow in the future. --- *all this talk about cars is making me want to actually work on a database about cars* – vol7ron Nov 05 '14 at 15:57
  • I added one last bullet point that, for the most part the filter will only need to look up *2* values, and not 20. I can't see it looking up any more that 5 specific *car_make*s. – vol7ron Nov 05 '14 at 16:04

4 Answers4

2

And here another option, don't know what the fastest one would be on large tables.

SELECT  customers.*
FROM    customers
    JOIN customers_cars USING(customer_id)
WHERE   car_make = ANY(ARRAY['BMW','Ford'])
GROUP BY
    customer_id, name
HAVING  array_agg(car_make) @> ARRAY['BMW','Ford'];

vol7ron: Fiddle

The following is a modification of the above, taking the same idea using an array for comparison. I'm not sure how any more efficient it would be compared to the dual-query approach, since it would have to create an array as one pass and then do more heavy-handed comparison because of comparing the elements of an array.

SELECT DISTINCT c.* 
FROM   customers c
WHERE  customer_id IN (
  select   customer_id
  from     customers_cars 
  group by customer_id
  having   array_agg(car_make) @> ARRAY['BMW','Ford']
);
vol7ron
  • 40,809
  • 21
  • 119
  • 172
Frank Heikens
  • 117,544
  • 24
  • 142
  • 135
  • So you suggest sticking the values in an array and comparing on that? – vol7ron Nov 05 '14 at 15:36
  • Yes because you can use the same input twice and it makes the usage of prepared statements (to avoid SQL injection) in your application easier. And if you need a third or forth parameter, just add it to the array and you're done: No rewriting of the query needed, like the other examples. – Frank Heikens Nov 05 '14 at 15:40
  • Feel free to take it out or modify it. I think they're doing almost the same thing. – vol7ron Nov 05 '14 at 15:55
  • Gave this +1 because it's much easier to read, though I selected the other answer because it's about 100% more efficient (even using the small dataset that I provided). – vol7ron Nov 05 '14 at 17:50
1

I would write it as

SELECT DISTINCT c.customer_id 
FROM   customers c
JOIN   customers_cars cc_f on c.customer_id = cc_f.customer_id and cc_f.car_make = 'Ford'
JOIN   customers_cars cc_b on c.customer_id = cc_b.customer_id and cc_b.car_make = 'BMW'
;

Whether this is better or not I don't know. In some RDBMs plain joins like this work better than subqueries, but I don't know about Postgres. From readability point of view it is also questionable.

MK.
  • 33,605
  • 18
  • 74
  • 111
  • An important piece of information I did not say was that car_make is not unique. – vol7ron Nov 05 '14 at 15:47
  • I gave it +1 -- Postgres tends to be pretty efficient. If it's not more efficient doing this way, that's because it's query planner has done something to translate the other way into this, making them equivalent. Regardless, this is what I was first after. – vol7ron Nov 05 '14 at 17:55
1

It seems to me that you are trying to find customers that has at least 1 BMW and at least 1 Ford car. This query should get that for you:

SELECT
     customers.customer_id
FROM
    customers
        INNER JOIN customer_cars ON
            customers.customer_id = customer_cars.customers_id
            AND customer_cars.car_make IN ('BMW', 'Ford')
GROUP BY
    customers.customer_id
HAVING
    COUNT(CASE WHEN car_make = 'BMW' THEN 1 ELSE NULL END) > 0
    AND COUNT(CASE WHEN car_make = 'Ford' THEN 1 ELSE NULL END) > 0

Make sure you have an indexes on customer_cars.customer_id and customer_cars.car_make to achieve maximum performance.

gmarintes
  • 1,288
  • 12
  • 16
  • Yes, but I'm not sure it's any more efficient than what I have, considering I'd have to wrap all of that in another query as well. – vol7ron Nov 05 '14 at 15:34
  • You need to tell us that other query so we can consider it when making an "optimized" query. – gmarintes Nov 05 '14 at 15:38
  • Check the EXPLAIN plan of the queries and also execution time on a large DB. – gmarintes Nov 05 '14 at 15:38
  • What other query? I was just saying that including the extra count fields would not be what I consider view-ready, so it would all have to be wrapped in a `select customer_id, <+other fields> from ()` – vol7ron Nov 05 '14 at 16:16
  • You can easily move the SUM() to the HAVING clause if you don't want them. I modified the query above for your reference. If what you mean by "wrap" is that you are going to use this query as a subselect, then you need to find a way to remove the subselect. subselects are slow in general cases. – gmarintes Nov 06 '14 at 02:40
  • I forget, isn't `count` more efficient than `sum` in this case, but I'll upvote this because I think that would work. Note, I still think Erwin's answer would be more efficient and in terms of readability/maintainability, my original sql would probably win. Thanks for being so diligent. – vol7ron Nov 13 '14 at 13:58
  • Yes, @vol7ron. I updated the query so that it uses count. It should give you a minor performance improvement. – gmarintes Nov 13 '14 at 15:06
  • You will notice the performance difference only when you have millions of records. For small databases, you won't see much difference between all the queries suggested. – gmarintes Nov 13 '14 at 15:08
  • Yeah, that's true. In my practice it's less about the number of records and more about the number of calls. Single application DBs that are not yet at the scale for distributed systems require different choices to be made regarding where certain logic is conducted; that, and resource availability causes a person to do things that are not ideal, but I digress. – vol7ron Nov 13 '14 at 15:18
0

You don't need to join to customers at all (given relational integrity).

Generally, this is a case of relational division. We assembled an arsenal of techniques under this related question:

Unique combinations

If (customer_id, car_make) was defined unique in customers_cars, it would get much simpler:

SELECT customer_id
FROM   customers_cars
WHERE  car_make IN ('BMW', 'Ford')
GROUP  BY 1
HAVING count(*) = 2;

Combinations not unique

Since (customer_id, car_make) is not unique, we need an extra step.

For only a few cars, your original query is not that bad. But (especially with duplicates!) EXISTS is typically faster than IN, and we don't need the final DISTINCT:

SELECT customer_id -- no DISTINCT needed.
FROM   customers c
WHERE  EXISTS (SELECT 1 FROM customers_cars WHERE customer_id = c.customer_id AND car_make = 'BMW')
AND    EXISTS (SELECT 1 FROM customers_cars WHERE customer_id = c.customer_id AND car_make = 'Ford');

Above query gets verbose and less efficient for a longer list of cars. For an arbitrary number of cars I suggest:

SELECT customer_id
FROM  (
   SELECT customer_id, car_make
   FROM   customers_cars
   WHERE  car_make IN ('BMW', 'Ford')
   GROUP  BY 1, 2
   ) sub
GROUP  BY 1
HAVING count(*) = 2;

SQL Fiddle.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Yeah, but it is not unique. The join to customers is vestigial of the original query (there will be other fields from customers), but I probably should not have included that to remove confusion. It's always a toss up with what to change and what to mention, since as you've seen I forgot to mention distinctness. – vol7ron Nov 05 '14 at 15:49
  • That seems like it would be more efficient than the *array* method Frank came up with: http://stackoverflow.com/a/26760799/183181 – vol7ron Nov 05 '14 at 16:08
  • @vol7ron: It certainly is. Array handling is relatively expensive in Postgres. pg 9.4 is improving this somewhat. But no need for speculation. Run tests with `EXPLAIN ANALYZE` (multiple runs to exclude caching effects) to see which one performs best. – Erwin Brandstetter Nov 05 '14 at 16:10