0

A database has collections of products; each collected product has a price recorded at the moment of being added to a collection, with a few other values.

// `collections_products`

id collection_id group product_id option_id price
1  1             0     56         0         3.1920
2  1             0     56         54        1.2000
3  1             0     56         55        2.4000
4  1             0     56         56        3.6000
5  1             0     56         57        4.8000
6  1             0     56         58        6.0000
7  1             0     57         0         3.1920
8  1             0     57         54        1.2000

11  10           0     56         0         3.1920
12  10           0     56         54        1.2000
13  10           0     56         55        2.4000
14  10           0     56         56        3.6000
15  10           0     56         57        4.8000
16  10           0     56         58        6.0000
17  10           0     57         0         3.1920
18  10           0     57         54        1.2000

21  100          0     56         0         9.9999
22  100          0     56         54        9.9999
23  100          0     56         55        9.9999
24  100          0     56         56        9.9999
25  100          0     56         57        9.9999
26  100          0     56         58        9.9999
27  100          0     57         0         9.9999
28  100          0     57         54        9.9999


31  1000         0     56         0         3.1920
32  1000         0     56         54        1.2000
33  1000         0     56         55        2.4000
34  1000         0     56         56        3.6000

36  1000         0     56         58        6.0000
37  1000         0     57         0         3.1920
38  1000         0     57         54        1.2000

Having some collection_id, I need to find other identical, duplicate collections (having identical content, i.e. same products, groups and options at same prices; order not important) to a given one.

In the examples above:

  • the set of rows with collection_id 10 (set B) is a duplicate of the set of rows with collection_id 1 (set A); for every row in A there is another row in B with an identical group product_id option_id price, and A and B have the same number of rows
  • the set of rows with collection_id 100 is NOT a duplicate of any other because all the prices are different
  • the set of rows with collection_id 1000 is NOT a duplicate of any other because the count of rows are different (row id 35 is missing compared to collection_id 1)

Came up with:

  • Have a select query looking for other collections based on what common IDs and values they have, all in one SQL statement, but unsure if this is at all possible with MySQL
  • Calculate a checksum of each collection's result set (group, product_id, option_id, price of each row, together), store it as collections.checksum, re-calculate each time there's movement inside a collection. When searching, get the checksum of the collection I have and select by that checksum.

Researched the checksum idea. Found:

Don't want to reinvent the wheel. Surprised I can't find anything reusable, unless I'm looking in a wrong direction.

What would be the right way to approach this? Please advice

UPDATE I'm not looking to delete any collections, even if they're duplicates. I need to combine them instead. This is a half-made-up example, sorry if it doesn't make 100% sense

Caius Jard
  • 72,509
  • 5
  • 49
  • 80
ᴍᴇʜᴏᴠ
  • 4,804
  • 4
  • 44
  • 57

1 Answers1

1

Something like this should work:

SELECT `product_id`, `option_id`, `group`, `price`, COUNT(*) as count_occurrences 
FROM `collections_products`
GROUP BY `product_id`, `option_id`, `group`, `price`
HAVING count_occurrences > 1;

This will give you all (product_id, option_id, price) combinations that occur more than once in your dataset. If you also want IDs of relevant rows, you can do a subquery with JOIN like this:

SELECT cp.`id` FROM
(SELECT `product_id`, `option_id`, `group`, `price`, COUNT(*) as count_occurrences 
FROM `collections_products`
GROUP BY `product_id`, `option_id`, `group`, `price`
HAVING count_occurrences > 1) t1
LEFT JOIN `collections_products` cp
ON t1.`product_id` = cp.`product_id` 
AND t1.`option_id` = cp.`option_id` 
AND t1.`group` = cp.`group`
AND t1.`price` = cp.`price`;

UPD:

To get the collection IDs that contain the same products as given collection, you'll need something like this:

SELECT DISTINCT t2.`collection_id` FROM
(SELECT `collection_id`,`product_id`, `option_id`, `group`, `price`
FROM `collections_products`
WHERE `collection_id`=?) t1
LEFT JOIN `collections_products` t2
ON t1.`product_id`=t2.`product_id`
AND t1.`option_id`=t2.`option_id`
AND t1.`group`=t2.`group`
AND t1.`price`=t2.`price`
AND t1.`collection_id`<>t2.`collection_id`;
ᴍᴇʜᴏᴠ
  • 4,804
  • 4
  • 44
  • 57
Sergey Kudriavtsev
  • 10,328
  • 4
  • 43
  • 68
  • Thanks for your input. Sorry, I missed something in my question: I need to search for duplicate collections based on a given `collection_id`. Edited it just now. – ᴍᴇʜᴏᴠ Sep 03 '19 at 09:28
  • @aexl I'm not sure I completely understand your question then. The first thing that comes into my mind would be adding `WHERE colledtion_id=?` clause to the query after `FROM collections_products` line in both cases. This will give you all rows with the same product, option, group and price within a specific collection. Is this what you need? – Sergey Kudriavtsev Sep 03 '19 at 09:33
  • I need to find other collections having the same content as a given collection. Content = same products, options, groups and prices. – ᴍᴇʜᴏᴠ Sep 03 '19 at 09:36