How to select similar sets in SQL

Question

I have the following tables:

Order
----
ID (pk)

OrderItem
----
OrderID (fk -> Order.ID)
ItemID (fk -> Item.ID)
Quantity

Item
----
ID (pk)

How can I write a query that can select all Orders that are at least 85% similar to a specific Order?

I considered using the Jaccard Index statistic to calculate the similarity of two Orders. (By taking the intersection of each set of OrderItems divided by the union of each set of OrderItems)

However, I can't think of a way to do so without storing the computed Jaccard Index for each possible combination of two Orders. Is there another way?

Also, is there a way to include the difference in Quantity of each matched OrderItem into account?

Additional Info:

Total Orders: ~79k
Total OrderItems: ~1.76m
Avg. OrderItems per Order: 21.5
Total Items: ~13k

Note

The 85% similarity number is just a best guess at what the customer actually needs, it may change in the future. A solution that works for any similarity would be preferable.

I'm thinking out loud.... create an item key on each item that can be turned into an integer. Create an order key on each order that can be turned into an integer. Create a calculated order key on each order that is the sum of the item keys and the base order key. That will turn each order into an int. You'd have to carefully decide how to construct the key, picking only the meaningful parts like ignore the person's name and address etc. and make items in the same category be similar to each other. — Brian White, Nov 14 '12 at 18:04
If you want to add number of items ordered, I think you need to do a better job of deciding what 85% simliar means. For instnce if I have an order with six items and another order with the same six items it is 100% similar just on the item number. if item 1 has 1000 items in one order and 450 in the other and all the others have the same amount waht is 85% similar? 5 out of 6 are an exact match but one is way off. Do I need to match based on the number wof items that ho match, how close they are as a group or by individual items? — HLGEM, Nov 14 '12 at 20:07
@BrianWhite You're basically describing a bitmap index (or really a vectormap, I guess) I think something along those lines would be the best solution to the problem, I just don't know how to do it on sql-server. — Zachary Yates, Nov 14 '12 at 23:09
@HLGEM The 85% similarity extends to the quantity axis as well, where order A of 4 Apples and order B of 3 Apples (75%) wouldn't match, but 8 and 7 Apples respectively would match. The quantity axis is less important - I'm not worried about it too much. — Zachary Yates, Nov 14 '12 at 23:15
Do you really want orders where person a bought 5 sprockets and person b bought 6 sprockets? Or do you want to find orders where person a ordered health products and garden products and person b also bought health products and garden products? I would think the latter would have more e-commerce applications. — Brian White, Nov 15 '12 at 03:30
@BrianWhite Yes, I do want to know that. The customer isn't involved in e-commerce (although I don't fault you for the assumption, many questions are related to that). They are a machine shop. They are trying to reduce their tooling downtime. — Zachary Yates, Nov 15 '12 at 22:16

Jonathan Leffler · Accepted Answer · 2012-11-19T14:56:07.013

You specify

How can I write a query that can select all orders that are at least 85% similar to a specific order?

This is an important simplification compared with 'all pairs of orders that are at least 85% similar to each other'.

We'll use some TDQD (Test-Driven Query Design) and some analysis to help us.

Preliminaries

To be remotely similar, the two orders must have at least one item in common. This query can be used to determine which orders have at least one item in common with a specified order:

SELECT DISTINCT I1.OrderID AS ID
  FROM OrderItem AS I1
  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
 WHERE I1.OrderID != <specified order ID>

This prunes the list of other orders to be examined quite a lot, though if the specified order included one of your most popular items, it's likely that a lot of other orders also did so.

Instead of the DISTINCT, you could use:

SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common
  FROM OrderItem AS I1
  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
 WHERE I1.OrderID != <specified order ID>
 GROUP BY I1.OrderID

This gives you the number of items in an order that it has in common with the specified order. We also need the number of items in each order:

SELECT OrderID AS ID, COUNT(*) AS Num_Total
  FROM OrderItem
 GROUP BY OrderID;

Identical Orders

For 100% similarity, the two orders would have as many items in common as each has items. This would probably not find many pairs of orders, though. We can find the orders with exactly the same items as the specified order easily enough:

SELECT L1.ID
  FROM (SELECT OrderID AS ID, COUNT(*) AS Num_Total
          FROM OrderItem
         GROUP BY OrderID
       ) AS L1
  JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common
          FROM OrderItem AS I1
          JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
         WHERE I1.OrderID != <specified order ID>
         GROUP BY I1.OrderID
       ) AS L2 ON L1.ID = L2.ID AND L1.Num_Total = L2.Num_Common;

Edit: This turns out not to be stringent enough; for the orders to be identical, the number of items in the specified order must also be the same as the number in common:

SELECT L1.ID, L1.Num_Total, L2.ID, L2.Num_Common, L3.ID, L3.Num_Total
  FROM (SELECT OrderID AS ID, COUNT(*) AS Num_Total
          FROM OrderItem
         GROUP BY OrderID
       ) AS L1
  JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS Num_Common
          FROM OrderItem AS I1
          JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
         WHERE I1.OrderID != <specified order ID>
         GROUP BY I1.OrderID
       ) AS L2 ON L1.ID = L2.ID AND L1.Num_Total = L2.Num_Common
  JOIN (SELECT OrderID AS ID, COUNT(*) AS Num_Total
          FROM OrderItem
         WHERE OrderID = <specified order ID>
         GROUP BY OrderID
       ) AS L3 ON L2.Num_Common = L3.Num_Total;

Similar Orders — Analyzing the Formula

Applying the Jaccard Similarity as defined at Wikipedia to two orders A and B, with |A| being the count of the number of items in order A, the Jaccard Similarity J(A,B) = |A∩B| ÷ |A∪B|, where |A∩B| is the number of items in common to the two orders and |A∪B| is the total number of different items ordered.

To meet an 85% Jaccard Similarity criterion, if the number of items in either order is less than some threshold, the orders must be identical. For example, if both orders A and B have 5 items, say, but there's one item different between the two, it gives you 4 items in common (|A∩B|) and 6 items in total (|A∪B|), so the Jaccard Similarity J(A,B) is only 66⅔%.

For 85% similarity when there are N items in each of the two orders and 1 item is different, (N-1) ÷ (N+1) ≥ 0.85, which means N > 12 (12⅓ to be precise). For a fraction F = J(A,B), one item different means (N-1) ÷ (N+1) ≥ F which can be solved for N giving N ≥ (1 + F) ÷ (1 - F). As the similarity requirement goes up, the orders must be identical for increasingly large values of N.

Generalizing still further, let's suppose we have different size orders with N and M items (without loss of generality, N < M). The maximum value of |A∩B| is now N and the minimum value of |A∪B| is M (meaning all the items in the smaller order appear in the larger order). Let's define that M = N + ∆, and that there are ∂ items present in the smaller order that are not present in the larger order. It follows that there are ∆+∂ items present in the larger order that are not in the smaller order.

By definition, then, |A∩B| = N-∂, and |A∪B| = (N-∂) + ∂ + (N+∆-(N-∂)), where the three added terms represent (1) the number of items in common between the two orders, (2) the number of items only in the smaller order, and (3) the number of items only in the larger order. This simplifies to: |A∪B| = N+∆+∂.

Key Equation

For a similarity fraction F, we're interested in pairs of orders where J(A,B) ≥ F, so:

(N-∂) ÷ (N+∆+∂) ≥ F

F ≤ (N-∂) ÷ (N+∆+∂)

We can use a spreadsheet to graph the relationship between these. For a given number of items in the smaller order (x-axis), and for a given similarity, we can graph the maximum value of ∂ that gives us a similarity of F. The formula is:

∂ = (N(1-F) - F∆) ÷ (1+F)

...plot of ∂ = (N(1-F) - F∆) ÷ (1+F)...

This is a linear equation in N and ∆ for constant F; it is non-linear for different values of F. Clearly, ∂ has to be a non-negative integer.

Given F = 0.85, for orders that are the same size (∆=0), for 1 ≤ N < 13, ∂ = 0; for 13 ≤ N < 25, ∂ ≤ 1; for 25 ≤ N < 37, ∂ ≤ 2, for 37 ≤ N < 50, ∂ ≤ 3.

For orders that differ by 1 (∆=1), for 1 ≤ N < 18, ∂ = 0; for 18 ≤ N < 31, ∂ ≤ 1; for 31 ≤ N < 43, ∂ ≤ 2; etc. If ∆=6, you need N=47 before the orders are still 85% similar with ∂=1. That means the small order has 47 items, of which 46 are in common with the large order of 53 items.

Similar Orders — Applying the Analysis

So far, so good. How can we apply that theory to selecting the orders similar to a specified order?

First, we observe that the specified order could be the same size as a similar order, or larger, or smaller. This complicates things a bit.

The parameters of the equation above are:

N – number of items in smaller order
∆ — difference between number of items in larger order and N
F — fixed
∂ — number of items in smaller order not matched in larger order

The values available using minor variations on the queries developed at the top:

NC — number of items in common
NA — number of items in specified order
NB — number of items in compared order

Corresponding queries:

SELECT OrderID AS ID, COUNT(*) AS NA
  FROM OrderItem
 WHERE OrderID = <specified order ID>
 GROUP BY OrderID;

SELECT OrderID AS ID, COUNT(*) AS NB
  FROM OrderItem
 WHERE OrderID != <specified order ID>
 GROUP BY OrderID;

SELECT I1.OrderID AS ID, COUNT(*) AS NC
  FROM OrderItem AS I1
  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
 WHERE I1.OrderID != <specified order ID>
 GROUP BY I1.OrderID

For convenience, we want the values N and N+∆ (and hence ∆) available, so we can use a UNION to arrange things appropriately, with:

NS = N — number of items in smaller order
NL = N + ∆ — number of items in larger order

and in the second version of the UNION query, with:

NC = N - ∂ — number of items in common

Both queries keep the two order ID numbers so that you can track back to the rest of the order information later.

SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL
  FROM (SELECT OrderID AS ID, COUNT(*) AS NA
          FROM OrderItem
         WHERE OrderID = <specified order ID>
         GROUP BY OrderID
       ) AS v1
  JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
          FROM OrderItem
         WHERE OrderID != <specified order ID>
         GROUP BY OrderID
       ) AS v2
    ON v1.NA <= v2.NB
UNION
SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL
  FROM (SELECT OrderID AS ID, COUNT(*) AS NA
          FROM OrderItem
         WHERE OrderID = <specified order ID>
         GROUP BY OrderID
       ) AS v1
  JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
          FROM OrderItem
         WHERE OrderID != <specified order ID>
         GROUP BY OrderID
       ) AS v2
    ON v1.NA > v2.NB

This gives us a table expression with columns OrderID_1, NS, OrderID_2, NL, where NS is the number of items in the 'smaller order and NL is the number of items in the larger order. Since there is no overlap in the order numbers generated by the v1 and v2 table expressions, there's no need to worry about 'reflexive' entries where the OrderID values are the same. Adding NC to this is most easily handled in the UNION query too:

SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
  FROM (SELECT OrderID AS ID, COUNT(*) AS NA
          FROM OrderItem
         WHERE OrderID = <specified order ID>
         GROUP BY OrderID
       ) AS v1
  JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
          FROM OrderItem
         WHERE OrderID != <specified order ID>
         GROUP BY OrderID
       ) AS v2
    ON v1.NA <= v2.NB
  JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
          FROM OrderItem AS I1
          JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
         WHERE I1.OrderID != <specified order ID>
         GROUP BY I1.OrderID
       ) AS v3
    ON v3.ID = v2.ID
UNION
SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
  FROM (SELECT OrderID AS ID, COUNT(*) AS NA
          FROM OrderItem
         WHERE OrderID = <specified order ID>
         GROUP BY OrderID
       ) AS v1
  JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
          FROM OrderItem
         WHERE OrderID != <specified order ID>
         GROUP BY OrderID
       ) AS v2
    ON v1.NA > v2.NB
  JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
          FROM OrderItem AS I1
          JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
         WHERE I1.OrderID != <specified order ID>
         GROUP BY I1.OrderID
       ) AS v3
    ON v3.ID = v1.ID

This gives us a table expression with columns OrderID_1, NS, OrderID_2, NL, NC, where NS is the number of items in the 'smaller order and NL is the number of items in the larger order, and NC is the number of items in common.

Given NS, NL, NC, we are looking for orders that satisfy:

(N-∂) ÷ (N+∆+∂) ≥ F.

N – number of items in smaller order
∆ — difference between number of items in larger order and N
F — fixed
∂ — number of items in smaller order not matched in larger order
NS = N — number of items in smaller order
NL = N + ∆ — number of items in larger order
NC = N - ∂ — number of items in common

The condition, therefore, needs to be:

NC / (NL + (NS - NC)) ≥ F

The term on the LHS must be evaluated as a floating point number, not as an integer expression. Applying that to the UNION query above, yields:

SELECT OrderID_1, NS, OrderID_2, NL, NC,
        CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity
  FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
          FROM (SELECT OrderID AS ID, COUNT(*) AS NA
                  FROM OrderItem
                 WHERE OrderID = <specified order ID>
                 GROUP BY OrderID
               ) AS v1
          JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
                  FROM OrderItem
                 WHERE OrderID != <specified order ID>
                 GROUP BY OrderID
               ) AS v2
            ON v1.NA <= v2.NB
          JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
                  FROM OrderItem AS I1
                  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
                 WHERE I1.OrderID != <specified order ID>
                 GROUP BY I1.OrderID
               ) AS v3
            ON v3.ID = v2.ID
        UNION
        SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
          FROM (SELECT OrderID AS ID, COUNT(*) AS NA
                  FROM OrderItem
                 WHERE OrderID = <specified order ID>
                 GROUP BY OrderID
               ) AS v1
          JOIN (SELECT OrderID AS ID, COUNT(*) AS NB
                  FROM OrderItem
                 WHERE OrderID != <specified order ID>
                 GROUP BY OrderID
               ) AS v2
            ON v1.NA > v2.NB
          JOIN (SELECT I1.OrderID AS ID, COUNT(*) AS NC
                  FROM OrderItem AS I1
                  JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
                 WHERE I1.OrderID != <specified order ID>
                 GROUP BY I1.OrderID
               ) AS v3
            ON v3.ID = v1.ID
       ) AS u
 WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F

You might observe that this query only uses the OrderItem table; the Order and Item tables are not needed.

Warning: partially tested SQL (caveat lector). The SQL above now seems to produce plausible answers on minuscule data sets. I adjusted the similarity requirement (0.25, then 0.55) and got plausible values and appropriate selectivity. However, my test data had but 8 items in the biggest order, and certainly wasn't covering the full scope of the described data. Since the DBMS I use most frequently does not support CTEs, the SQL below is untested. However, I am moderately confident that unless I made a big mistake, the CTE code in version 1 (with lots of repetition of the specified order ID) should be clean. I think version 2 may be OK too, but...it is untested.

There may be more compact ways of expressing the query, possibly using the OLAP functions.

If I was going to test this, I'd create a table with a few representative sets of order items, checking that the similarity measure returned was sensible. I'd work the queries more or less as shown, gradually building up the complex query. If one of the expressions was shown to be flawed, then I'd make appropriate adjustments in that query until the flaw was fixed.

Clearly, performance will be an issue. The innermost queries are not dreadfully complex, but they aren't wholy trivial. However, measurement will show whether it's a dramatic problem or just a nuisance. Studying the query plans may help. It seems very probable that there should be an index on OrderItem.OrderID; the queries are unlikely to perform well if there isn't such an index. That is unlikely to be a problem since it is a foreign key column.

You might get some benefit out of using 'WITH clauses' (Common Table Expressions). They would make explicit the repetition that is implicit in the two halves of the UNION sub-query.

Using Common Table Expressions

Using common table expressions clarifies to the optimizer when expressions are the same, and may help it perform better. They also help the humans reading your query. The query above does rather beg for the use of CTEs.

Version 1: Repeating the specified order number

WITH SO AS (SELECT OrderID AS ID, COUNT(*) AS NA       -- Specified Order (SO)
              FROM OrderItem
             WHERE OrderID = <specified order ID>
             GROUP BY OrderID
           ),
     OO AS (SELECT OrderID AS ID, COUNT(*) AS NB       -- Other orders (OO)
              FROM OrderItem
             WHERE OrderID != <specified order ID>
             GROUP BY OrderID
           ),
     CI AS (SELECT I1.OrderID AS ID, COUNT(*) AS NC    -- Common Items (CI)
              FROM OrderItem AS I1
              JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID AND I2.OrderID = <specified order ID>
             WHERE I1.OrderID != <specified order ID>
             GROUP BY I1.OrderID
           )
SELECT OrderID_1, NS, OrderID_2, NL, NC,
        CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity
  FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
          FROM SO AS v1
          JOIN OO AS v2 ON v1.NA <= v2.NB
          JOIN CI AS v3 ON v3.ID  = v2.ID
        UNION
        SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
          FROM SO AS v1
          JOIN OO AS v2 ON v1.NA  > v2.NB
          JOIN CI AS v3 ON v3.ID  = v1.ID
       ) AS u
 WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F

Version 2: Avoiding repeating the specified order number

WITH SO AS (SELECT OrderID AS ID, COUNT(*) AS NA       -- Specified Order (SO)
              FROM OrderItem
             WHERE OrderID = <specified order ID>
             GROUP BY OrderID
           ),
     OO AS (SELECT OI.OrderID AS ID, COUNT(*) AS NB    -- Other orders (OO)
              FROM OrderItem AS OI
              JOIN SO ON OI.OrderID != SO.ID
             GROUP BY OI.OrderID
           ),
     CI AS (SELECT I1.OrderID AS ID, COUNT(*) AS NC    -- Common Items (CI)
              FROM OrderItem AS I1
              JOIN SO AS S1 ON I1.OrderID != S1.ID
              JOIN OrderItem AS I2 ON I2.ItemID = I1.ItemID
              JOIN SO AS S2 ON I2.OrderID  = S2.ID
             GROUP BY I1.OrderID
           )
SELECT OrderID_1, NS, OrderID_2, NL, NC,
        CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) AS Similarity
  FROM (SELECT v1.ID AS OrderID_1, v1.NA AS NS, v2.ID AS OrderID_2, v2.NB AS NL, v3.NC AS NC
          FROM SO AS v1
          JOIN OO AS v2 ON v1.NA <= v2.NB
          JOIN CI AS v3 ON v3.ID  = v2.ID
        UNION
        SELECT v2.ID AS OrderID_1, v2.NB AS NS, v1.ID AS OrderID_2, v1.NA AS NL, v3.NC AS NC
          FROM SO AS v1
          JOIN OO AS v2 ON v1.NA  > v2.NB
          JOIN CI AS v3 ON v3.ID  = v1.ID
       ) AS u
 WHERE CAST(NC AS NUMERIC) / CAST(NL + NS - NC AS NUMERIC) >= 0.85 -- F

Neither of these is an easy read; both are easier than the big SELECT with the CTEs written out in full.

Minimal test data

This is inadequate for good testing. It gives a small modicum of confidence (and it did show up the problem with the 'identical order' query.

CREATE TABLE Order (ID SERIAL NOT NULL PRIMARY KEY);
CREATE TABLE Item  (ID SERIAL NOT NULL PRIMARY KEY);
CREATE TABLE OrderItem
(
    OrderID INTEGER NOT NULL REFERENCES Order,
    ItemID INTEGER NOT NULL REFERENCES Item,
    Quantity DECIMAL(8,2) NOT NULL
);

INSERT INTO Order VALUES(1);
INSERT INTO Order VALUES(2);
INSERT INTO Order VALUES(3);
INSERT INTO Order VALUES(4);
INSERT INTO Order VALUES(5);
INSERT INTO Order VALUES(6);
INSERT INTO Order VALUES(7);

INSERT INTO Item VALUES(111);
INSERT INTO Item VALUES(222);
INSERT INTO Item VALUES(333);
INSERT INTO Item VALUES(444);
INSERT INTO Item VALUES(555);
INSERT INTO Item VALUES(666);
INSERT INTO Item VALUES(777);
INSERT INTO Item VALUES(888);
INSERT INTO Item VALUES(999);

INSERT INTO OrderItem VALUES(1, 111, 1);
INSERT INTO OrderItem VALUES(1, 222, 1);
INSERT INTO OrderItem VALUES(1, 333, 1);
INSERT INTO OrderItem VALUES(1, 555, 1);

INSERT INTO OrderItem VALUES(2, 111, 1);
INSERT INTO OrderItem VALUES(2, 222, 1);
INSERT INTO OrderItem VALUES(2, 333, 1);
INSERT INTO OrderItem VALUES(2, 555, 1);

INSERT INTO OrderItem VALUES(3, 111, 1);
INSERT INTO OrderItem VALUES(3, 222, 1);
INSERT INTO OrderItem VALUES(3, 333, 1);
INSERT INTO OrderItem VALUES(3, 444, 1);
INSERT INTO OrderItem VALUES(3, 555, 1);
INSERT INTO OrderItem VALUES(3, 666, 1);

INSERT INTO OrderItem VALUES(4, 111, 1);
INSERT INTO OrderItem VALUES(4, 222, 1);
INSERT INTO OrderItem VALUES(4, 333, 1);
INSERT INTO OrderItem VALUES(4, 444, 1);
INSERT INTO OrderItem VALUES(4, 555, 1);
INSERT INTO OrderItem VALUES(4, 777, 1);

INSERT INTO OrderItem VALUES(5, 111, 1);
INSERT INTO OrderItem VALUES(5, 222, 1);
INSERT INTO OrderItem VALUES(5, 333, 1);
INSERT INTO OrderItem VALUES(5, 444, 1);
INSERT INTO OrderItem VALUES(5, 555, 1);
INSERT INTO OrderItem VALUES(5, 777, 1);
INSERT INTO OrderItem VALUES(5, 999, 1);

INSERT INTO OrderItem VALUES(6, 111, 1);
INSERT INTO OrderItem VALUES(6, 222, 1);
INSERT INTO OrderItem VALUES(6, 333, 1);
INSERT INTO OrderItem VALUES(6, 444, 1);
INSERT INTO OrderItem VALUES(6, 555, 1);
INSERT INTO OrderItem VALUES(6, 777, 1);
INSERT INTO OrderItem VALUES(6, 888, 1);
INSERT INTO OrderItem VALUES(6, 999, 1);

INSERT INTO OrderItem VALUES(7, 111, 1);
INSERT INTO OrderItem VALUES(7, 222, 1);
INSERT INTO OrderItem VALUES(7, 333, 1);
INSERT INTO OrderItem VALUES(7, 444, 1);
INSERT INTO OrderItem VALUES(7, 555, 1);
INSERT INTO OrderItem VALUES(7, 777, 1);
INSERT INTO OrderItem VALUES(7, 888, 1);
INSERT INTO OrderItem VALUES(7, 999, 1);
INSERT INTO OrderItem VALUES(7, 666, 1);

Holy crap. Not only is this answer insanely useful, but you have the same phonetic last name as Ashley Judd's character in one of my favorite TNG episodes. http://en.memory-alpha.org/wiki/Robin_Lefler — Zachary Yates, Nov 19 '12 at 01:10
Just to clarify, all of the references to the `ID` column of the `OrderItem` table mean `OrderID`, right? — Zachary Yates, Nov 19 '12 at 06:45
@ZacharyYates: Yes, correct. Any direct reference to ID from OrderItem would be OrderID; it is probably a hangover from when I had a reference to the main Order table in the queries. That 'caveat lector' warning was not in there for no purpose at all...sadly. I see I have plain ID in the select-list of the 'Common Items' sub-query; that should be `OrderID AS ID`. Oops! — Jonathan Leffler, Nov 19 '12 at 06:56
I've gone through the answer and 'corrected' it — I think. But I'm sitting at the wrong computer for testing it. — Jonathan Leffler, Nov 19 '12 at 07:03
Looks like just 1 typo in the last code block, but I updated it. — Zachary Yates, Nov 19 '12 at 07:11
@ZacharyYates: Thanks; yes, it was a typo. I've chased it back through the prior-to-last statements where it also occurred; copy'n'paste is powerful until you make a mistake. — Jonathan Leffler, Nov 19 '12 at 07:19
is it efficient for big data sets considering the joins I doubt that — tourist, Oct 09 '15 at 07:27
@tourist: I don't believe I made any comments about the code being efficient. Indeed, one paragraph starts _'Clearly, performance will be an issue'._ — Jonathan Leffler, Oct 09 '15 at 07:54
This is likely the best answer I have seen on stack overflow... what a brilliant post! — Bjorg P, Aug 25 '21 at 15:51

Will Hartung · Answer 2 · 2012-10-30T17:11:15.167

There's really no easy answer to this. You can certainly store the Jaccard index (actually I'd just store the ones that meet the criteria, and throw out the rest), but the real problem is calculating it (effectively have to scan all of your existing order each time a new order was entered in to the system to calculate the new index).

That can be quite expensive depending on your volume of orders that's you're maintaining. Maybe you only compare it to the last year of orders, or something.

If you're doing it on the fly, it gets more interesting, but still expensive.

You can readily get a list of all orders that have the same product items. One list per item. This, in fact, is not necessarily a lot of data (if you have a lot of orders for a single popular item, then it can be a long list). The individual queries aren't particularly insane either (again depending on your data). If you have a vast vast amount of data, the query can be readily map/reduced and even work with sharded data stores. Bitmap indexes (if your DB support this) are particularly good for getting lists like this quite quickly.

Then you can simply count the times that an order number occurs in all of the lists, and then drop those off that don't meet the threshold. That's a straight forward merge operation.

But you'd have to do this calculation every single time you'd want the information, since you can't really store it.

So, it really does boil down to what you need the information for, how often you need it, your items <-> order distribution, how long you can wait for it, etc.

Addenda:

Thinking about it a little more, this is a simple query, but it may take some time to run. Likely not much with modern hardware, you don't really have that much data. For a single screen viewing an order you wouldn't notice it. If you were running report across all orders, then you would definitely notice it -- and would need a different approach.

Lets consider an order with 20 line items.

And you want an 85% match. That means orders that have 17 or more items in common.

Here is a query that will give you the orders you're interested in:

SELECT orderId, count(*) FROM OrderItem
WHERE itemId in ('list', 'of', 'items', 'in', 'order', 123, 456, 789)
GROUP BY orderId
HAVING count(*) >= 17

So, this gives you a collection of all the line items with the same items as your order. Then you simply sum them up by orderId, and those that are equal to or greater than your threshold (17 in this case), are the candidate orders.

Now, you don't say how many items you have in your catalog. If you have 1000 items, perfectly distributed, this query will chew on 1600 rows of data -- which is no big deal. With proper indexes this should go quite quickly. However, if you have items that are "really popular", then you're going to chew through a lot more rows of data.

But, again, you don't have that much data. Most of this query can be done within the indexes on a proper database and not even hit the actual tables. So, as I said, you'll likely not notice the impact of this query on a interactive system.

So, give it a try and see how it goes for you.

Thank you for the excellent description of the problem space. I'll get a few statistics and update the original question. — Zachary Yates, Oct 30 '12 at 05:51
You should probably use `COUNT(DISTINCT itemId)` to protect against duplicates that cause you to get a false positive. — Bill Karwin, Nov 13 '12 at 00:22
And eventually you will want to make this dynamic SQl to adjust for the individual order you are comparing the others against. — HLGEM, Nov 14 '12 at 20:09
I believe that finding all orders which have at least 85% identical orders as another order is a very simple and fast task. — alzaimar, Nov 17 '12 at 07:54

JRideout · Answer 3 · 2012-11-14T23:21:34.587

This approach takes into account the Quantity using the Extended Jaccard Coefficient or Tanimoto Similarity. It computes the similarity across all Orders, by using a the vector of common ItemIDs of magnitude Quantity. It does cost a table scan, but doesn't require an N^2 computation of all possible similarities.

SELECT
    OrderID,
    SUM(v1.Quantity * v2.Quantity) /
    (SUM(v1.Quantity * v1.Quantity) +
     SUM(v2.Quantity * v2.Quantity) -
     SUM(v1.Quantity * v2.Quantity) ) AS coef
FROM
    OrderItem v1 FULL OUTER JOIN OrderItem v2
    ON v1.ItemID = v2.ItemID
    AND v2.OrderID = ?
GROUP BY OrderID
HAVING coef > 0.85;

Formula for the Extended Jaccard Coefficient:

Tanimoto Similarity

The `HAVING coef > 0.85` was the only part that doesn't work verbatim, you have to use the entire `HAVING SUM(v1.Quantity * v2.Quantity /...` expression. (Tested in SqlServer 2008 R2) — Zachary Yates, Nov 19 '12 at 21:42

Yellowfog · Answer 4 · 2012-11-15T09:05:14.067

This isn't really an answer so much as an extended comment. I'll remove it if it's deemed not to make sense.

If you're trying to find the 'similar' items on the fly, then the problem is that you have a lot (~79k) of Orders to look at. So if you're trying to do this, then you need ways to cut down the number of Orders you're considering before you do the expensive set comparison.

One way, pointed out by @Will, is to take into account the number of items in the Orders. So if your target order has 20 items then you only need to consider Orders with 17-23 OrderItems (or something like that, depending upon the exact calculation for '85% similarity'). I'm assuming that these numbers can be calculated via a trigger, whenever an Order is created or changed, and stored in columns on the Order table.

But if you can store the size of the set then you could also store other numbers. For instance, you could store the number of odd OrderItem primary key values in each Order. Then the Orders that you're considering would have to be appropriately close to having that number of odd order numbers (I might do the maths at some point to fill out 'appropriately close').

And if you think of partitioning the values by 'odd' numbers as splitting into size-1 stripes, you can easily partition by differently sized stripes using the modulus operator. Eg. ItemID%4<2 will made size-2 stripes. You can then record for each Order the number of OrderItem primary keys within those stripes. Your candidate Orders will have to be appropriately close to your target Order on each way of partitioning the values.

So what you'll end up with is a big subquery that tries to limit the size of the candidates in the Orders table by looking at a whole bunch of metrics that are stored - and indexed - on that table.

digscoop · Answer 5 · 2012-11-15T09:52:39.580

I would try something like this for speed, listing orders by similarity to Order @OrderId. The joined INTS is supposed to be the intersection and the similarity value is my attempt to calculate the Jaccard index.

I am not using the quantity field at all here, but i think it could be done too without slowing the query down too much if we figure out a way to quantify similarity that includes quantity. Below, I am counting any identical item in two orders as a similarity. You could join on quantity as well, or use a measure where a match that includes quantity counts double. I don't know if that is reasonable.

SELECT 
    OI.OrderId,
    1.0*COUNT(INTS.ItemId) / 
    (COUNT(*)
    + (SELECT COUNT(*) FROM OrderItem WHERE OrderID = @OrderId) 
    - COUNT(INTS.ItemId)) AS Similarity
FROM    
    OrderItem OI        
JOIN
    OrderItem INTS ON INTS.ItemID = OI.ItemID AND INTS.OrderId=@OrderId
GROUP BY 
    OI.OrderId
HAVING  
    1.0*COUNT(INTS.ItemId) / 
    (COUNT(*)
    + (SELECT COUNT(*) FROM OrderItem WHERE OrderID = @OrderId) 
    - COUNT(INTS.ItemId)) > 0.85
ORDER BY
    Similarity DESC

It also presupposes that OrderId/ItemId combinations are unique in OrderItem. I realize this might not be the case, and it could be worked around using a view.

I'm sure there are better ways, but one way to weigh in quantify difference be to replace the nominator COUNT(INTS.ItemId) with something like this (supposing all quantities to be positive) that decreases the hit slowly towards 0 when the quantities differ.

    1/(ABS(LOG(OI.quantity)-LOG(INTS.quantity))+1)

Added: This more readable solution using the Tanimoto Similarity suggested by JRideout

DECLARE 
    @ItemCount INT,
    @OrderId int 
SELECT     
    @OrderId  = 1
SELECT     
    @ItemCount = COUNT(*)
FROM 
    OrderItem
WHERE 
    OrderID = @OrderId 


SELECT 
    OI.OrderId,
    SUM(1.0* OI.Quantity*INTS.Quantity/(OI.Quantity*OI.Quantity+INTS.Quantity*INTS.Quantity-OI.Quantity*INTS.Quantity )) /
    (COUNT(*) + @ItemCount - COUNT(INTS.ItemId)) AS Similarity
FROM    
    OrderItem OI        
LEFT JOIN
    OrderItem INTS ON INTS.ItemID = OI.ItemID AND INTS.OrderId=@OrderId
GROUP BY 
    OI.OrderId
HAVING      
    SUM(1.0* OI.Quantity*INTS.Quantity/(OI.Quantity*OI.Quantity+INTS.Quantity*INTS.Quantity-OI.Quantity*INTS.Quantity )) /
    (COUNT(*) + @ItemCount - COUNT(INTS.ItemId)) > 0.85
ORDER BY
    Similarity DESC

And if there are doubles or quantity 0 somewhere, try going through a view like SELECT OrderId, ItemId, Sum(Quantity) AS Quantity FROM OrderItem OI GROUP BY OrderId, ItemId HAVING Sum(Quantity)>0 — digscoop, Nov 14 '12 at 19:50
Both of your assumptions below the code block are true, OrderItem.Quantity > 0, and OrderItem.ItemID + OrderItem.OrderID is the PK. — Zachary Yates, Nov 14 '12 at 23:18
Then this should work. The log formula might be a little naive though. — digscoop, Nov 15 '12 at 08:33
Just replace the intersection count COUNT(INTS.ItemId) with the SUM of the weighing function you like best based on OI.Quantity and INTS.Quantity — digscoop, Nov 15 '12 at 08:58
Added a more readable solution using the Tanimoto Similarity suggested by JRideout. I runs nicely on a 2 million rows OrderItem table. — digscoop, Nov 18 '12 at 18:49

score 1 · Answer 6 · answered Nov 14 '12 at 22:01

Hmm, funny, I am currently working on something similar. Why don't you just join the sample order (i.e. their items) with all other orders (their items) and list all orders which have at least 85% matches by grouping the count of matches per order?

-- let @SampleorderID be the ID of a sample

declare @totalOrders int, @ThresholdOrderCount int
select @totalOrders = count(*) from OrderItems where orderID=@SampleOrderID

set @ThresholdOrderCount = 85*@totalOrders/100 -- 85% of the item amount of the sample

-- Now match the contents of the sample order with the contents of all other orders
-- count the #matches and show only those orders with at least 85% identical items
Select AllOrder.OrderID, count(*)
from   OrderItems sample 
       join OrderItems AllOrder 
         on sample.ItemID = AllOrder.ItemID
where sample.OrderID = @SampleOrderID 
  and sample.OrderID<>AllOrder.OrderID
group by AllOrder.OrderID 
having count(*)>@ThresholdOrderCount

This should work. However, it also returns orders, which contain more items than the sample. If that is ok, then the above query should also be quite fast.

Steve Ford · Answer 7 · 2012-11-16T12:45:47.800

The approach that I would take would be to first of all find all orderitems that are 85% similar to the orderitems for the selected order, then count the number of these orderitems for each order and check whether the number of items is 85% similar to the selected order, using the following query:

DECLARE @OrderId int = 2
SET @OrderId = 6

/*
Retrieve orderitems that match 85% with required @orderId
*/
;WITH SelectedOrderItemCount
AS
(
    SELECT COUNT(*) * 0.85 AS LowerBoundary, COUNT(*) * 1.15 AS UpperBoundary
    FROM OrderItem
    WHERE OrderId = @OrderId
)
SELECT OtherOrders.OrderId, COUNT(*) as NumberOfOrderItems
FROM OrderItem SpecificOrder
INNER JOIN OrderItem OtherOrders
    ON OtherOrders.ItemId = SpecificOrder.ItemId
WHERE SpecificOrder.OrderId = @OrderId AND
      OtherOrders.OrderId <> @OrderId AND
    OtherOrders.Quantity BETWEEN SpecificOrder.Quantity * 0.85 AND SpecificOrder.Quantity * 1.15
GROUP BY OtherOrders.OrderId
HAVING COUNT(*) BETWEEN (SELECT LowerBoundary FROM SelectedOrderItemCount) 
                    AND (SELECT UpperBoundary FROM SelectedOrderItemCount)

Full SQLFiddle demo here

score 0 · Answer 8 · answered Nov 18 '12 at 22:17

0

This is one kind of data mining problem. So instead of using sql you can use Apriori algorithm with 85% support. The implementation of this algorithm is freely available in many tools.

answered Nov 18 '12 at 22:17

Ahmed Tanvir

187
7