1

I have two tables orders, and line_items with following structure:

Orders (id = PK, indexes on user_id)
-------------------------------------
id   user_id
==   ======
1     1
2     2
3     1
4     3
5     1

LineItems (id = PK, indexes on order_id and product_id)
id   order_id product_id quantity
==   ======   ========   ======
1      1         1       1
2      1         2       2
3      2         1       4
4      2         3       6
5      3         1       1
6      4         1       1
7      5         1       1

I am trying to find the most efficient way to solve the following requirements:

  • Given a user and a product find the LineItems belonging to Orders where given product is the only product ordered. E.g: If user_id is 1 and product_id is 1, the query should return line items 5 and 7.

  • Given a user and a product find the Orders where given product is the only product ordered. E.g: If user_id is 1 and product_id is 1, the query should return orders 3 and 5.

The Orders and LineItems table can have millions of rows.

I have a working solution that uses COUNT and HAVING. I am not certain that this is the most efficient solution.

Specifically, I am wondering if this can be addressed by using the technique outlined by Cletus in this answer.

Note: I am using Orders and LineItems tables to describe the scenario. My actual table is quite different and it not related to order etc.

Edit 2

Is this query efficient than using GROUP BY and HAVING?

SELECT A.id
FROM   LineItems A
JOIN   Orders B ON B.id = A.order_id AND B.user_id = 1
LEFT OUTER JOIN LineItems C ON C.order_id = A.order_id AND 
                               C.product_id != A.product_id   
WHERE  A.product_id = 1 AND C.id IS NULL
Community
  • 1
  • 1
Harish Shetty
  • 64,083
  • 21
  • 152
  • 198

4 Answers4

1
select o.id OrderID, MIN(i.id) LineItemID
from orders o
inner join lineitems i on i.order_id = o.id
where o.user_id= 1
group by o.id
having count(*)=1

GROUP BY, HAVING, COUNT is the most efficient for this type of query. Basically it will scan the required data fully, but only within the user's orders, but in that single pass will produce the result.

You can kill two birds with one stone, since for orders with a single line item, min(i.id) gives you the (only) LineItemID.

Indexes you NEED to have: orders.user_id, lineitems.order_id

RichardTheKiwi
  • 105,798
  • 26
  • 196
  • 262
  • I need to access the order id and line item id separately as I am using the query to select the candidate rows for deletion. I have updated my question with a solution I am considering. Let me know what you think. – Harish Shetty Mar 31 '11 at 05:12
  • @KandadaBoggu - If you only need one of them, you can remove the other from the SELECT clause; that's what I meant. Your revised query is an alternative, there is no clear "faster" query because it always depends on the distribution of data, e.g. average lines per order. Try both and use the faster one – RichardTheKiwi Mar 31 '11 at 05:30
  • I expect 1-5 line items per order and 100s of thousands of line items for user + product combination. – Harish Shetty Mar 31 '11 at 05:36
  • @KandadaBoggu - I could be very wrong, but at an average of 3 lines per order, the queries should be roughly similar in performance. Let us know once you have tried both. – RichardTheKiwi Mar 31 '11 at 05:42
1
select 
  * 
from 
  (
    select 
      *  
    from 
      LineItems   
    group by 
      order_id 
    having count(*) = 1 
  ) l 
    inner join Orders o on l.order_id = o.id and user_id =1 and product_id =1
SchmitzIT
  • 9,227
  • 9
  • 65
  • 92
Abhijit Gaikwad
  • 3,072
  • 28
  • 37
0

Count(*) =1 is special: you don't need to actually count to detect it You could for instance use NOT EXISTS to pick the wanted tuples:

SELECT id
FROM lineitems li
WHERE NOT EXISTS (
    SELECT *
    FROM lineitems nx
    WHERE nx.order_id = li.order_id
    AND nx.id <> li.id
    )
    ;

This (sub)query can be very fast (most codegenerators will detect it as an ANTI-join). The grouping (on order_id) will still be needed internally, but the counting can be omitted. (the subquery can return false once the first duplicate order_id is encountered)

wildplasser
  • 43,142
  • 8
  • 66
  • 109
0

If you have really huge project and really huge amount of data then it would be better to have "similar goods" to be precalculated, and refreshed by some scheduler (once a day, hour, week, ...) or some "trigger" (after new good was added).

It is impossible to make the queries that you mentioned (using COUNT + HAVING + GROUP BY) to be highly performant.

zerkms
  • 249,484
  • 69
  • 436
  • 539
  • I am using Orders and LineItems tables to describe the scenario. My actual table is quite different and it not related to order etc. – Harish Shetty Mar 31 '11 at 02:56
  • @KandadaBoggu: so what? ;-) The main idea is: if you cannot optimize the query (or it is not possible to do; yes, there are such cases) then it can be a solution to precalculate the data. – zerkms Mar 31 '11 at 02:57
  • Specifically, I am wondering if this can be addressed by using the technique outlined by `Bill` in this answer http://stackoverflow.com/questions/477006/sql-statement-join-vs-group-by-and-having/477035#477035. – Harish Shetty Mar 31 '11 at 03:07
  • My motive here is to delete the rows after finding them. I would like to avoid maintaining a separate table(if I can). – Harish Shetty Mar 31 '11 at 03:11
  • Per the 'Bill' posting you mention, did you read the Cletus answer that has test times included? 59 votes, VS 16 tells me that for large data sets you want to learn how to use JOINs. (and to be honest, group by, having was my first thought too) – shellter Mar 31 '11 at 04:15
  • @shelter, @zerkms: I have updated my question with a solution I am considering. Let me know what you think. – Harish Shetty Mar 31 '11 at 05:13