Selecting identical items based on shared foreign IDs they have

Question

A database has collections of products; each collected product has a price recorded at the moment of being added to a collection, with a few other values.

// `collections_products`

id collection_id group product_id option_id price
1  1             0     56         0         3.1920
2  1             0     56         54        1.2000
3  1             0     56         55        2.4000
4  1             0     56         56        3.6000
5  1             0     56         57        4.8000
6  1             0     56         58        6.0000
7  1             0     57         0         3.1920
8  1             0     57         54        1.2000

11  10           0     56         0         3.1920
12  10           0     56         54        1.2000
13  10           0     56         55        2.4000
14  10           0     56         56        3.6000
15  10           0     56         57        4.8000
16  10           0     56         58        6.0000
17  10           0     57         0         3.1920
18  10           0     57         54        1.2000

21  100          0     56         0         9.9999
22  100          0     56         54        9.9999
23  100          0     56         55        9.9999
24  100          0     56         56        9.9999
25  100          0     56         57        9.9999
26  100          0     56         58        9.9999
27  100          0     57         0         9.9999
28  100          0     57         54        9.9999


31  1000         0     56         0         3.1920
32  1000         0     56         54        1.2000
33  1000         0     56         55        2.4000
34  1000         0     56         56        3.6000

36  1000         0     56         58        6.0000
37  1000         0     57         0         3.1920
38  1000         0     57         54        1.2000

Having some collection_id, I need to find other identical, duplicate collections (having identical content, i.e. same products, groups and options at same prices; order not important) to a given one.

In the examples above:

the set of rows with collection_id 10 (set B) is a duplicate of the set of rows with collection_id 1 (set A); for every row in A there is another row in B with an identical group product_id option_id price, and A and B have the same number of rows
the set of rows with collection_id 100 is NOT a duplicate of any other because all the prices are different
the set of rows with collection_id 1000 is NOT a duplicate of any other because the count of rows are different (row id 35 is missing compared to collection_id 1)

Came up with:

Have a select query looking for other collections based on what common IDs and values they have, all in one SQL statement, but unsure if this is at all possible with MySQL
Calculate a checksum of each collection's result set (group, product_id, option_id, price of each row, together), store it as collections.checksum, re-calculate each time there's movement inside a collection. When searching, get the checksum of the collection I have and select by that checksum.

Researched the checksum idea. Found:

MySQL rows checksum & mySQL: get hash value for each row?: checksums the individual rows, but not a result set
Checksum of SELECT results in MySQL: uses CRC32 and has an Expected collisions warning, which looks reasonable

Don't want to reinvent the wheel. Surprised I can't find anything reusable, unless I'm looking in a wrong direction.

What would be the right way to approach this? Please advice

UPDATE I'm not looking to delete any collections, even if they're duplicates. I need to combine them instead. This is a half-made-up example, sorry if it doesn't make 100% sense

You mean match an entire collection with another and see it bot have exact same rows? — Salman A, Sep 03 '19 at 09:14
@aexl Check this answer: https://stackoverflow.com/a/57712189/2469308 — Madhur Bhaiya, Sep 03 '19 at 09:21
Surely you're looking for rows that have the same `(group, product_id, option_id, price)` and a different id, and you're wanting to keep only the highest id (most recent)? — Caius Jard, Sep 03 '19 at 09:22
@CaiusJard No, I'm not looking to delete duplicates, but actually combine them. Edited the question to (hopefully) make that more obvious. Sorry about the confusion — ᴍᴇʜᴏᴠ, Sep 03 '19 at 09:32
What does COMBINE mean? add their prices together? average their prices? — Caius Jard, Sep 03 '19 at 09:36
@CaiusJard It does not matter (I mean it will complicate the question). I need to know their `collection_id`s. That's it. — ᴍᴇʜᴏᴠ, Sep 03 '19 at 09:38
So you're looking for rows that have the same (group, product_id, option_id, price) and a different collection_id? (But all the rows in a collection operate in totality) — Caius Jard, Sep 03 '19 at 09:39
I have some `collection_id`. I'm looking for different `collection_id`(s), if any. I don't need the rows themselves, as long as I'm sure those rows have identical `product_id`, `option_id`, `group`, `price` as the reference collection. — ᴍᴇʜᴏᴠ, Sep 03 '19 at 09:42
May I edit your question to add some detail that I think will help? — Caius Jard, Sep 03 '19 at 09:43
A fresh set of eyes always helps, thanks. Will I be able to correct if there's a misunderstanding? — ᴍᴇʜᴏᴠ, Sep 03 '19 at 09:44
@CaiusJard sure, thank you! (will delete this comment later to avoid cluttering) — ᴍᴇʜᴏᴠ, Sep 03 '19 at 09:49

score 1 · Accepted Answer · edited Sep 03 '19 at 13:55

Something like this should work:

SELECT `product_id`, `option_id`, `group`, `price`, COUNT(*) as count_occurrences 
FROM `collections_products`
GROUP BY `product_id`, `option_id`, `group`, `price`
HAVING count_occurrences > 1;

This will give you all (product_id, option_id, price) combinations that occur more than once in your dataset. If you also want IDs of relevant rows, you can do a subquery with JOIN like this:

SELECT cp.`id` FROM
(SELECT `product_id`, `option_id`, `group`, `price`, COUNT(*) as count_occurrences 
FROM `collections_products`
GROUP BY `product_id`, `option_id`, `group`, `price`
HAVING count_occurrences > 1) t1
LEFT JOIN `collections_products` cp
ON t1.`product_id` = cp.`product_id` 
AND t1.`option_id` = cp.`option_id` 
AND t1.`group` = cp.`group`
AND t1.`price` = cp.`price`;

UPD:

To get the collection IDs that contain the same products as given collection, you'll need something like this:

SELECT DISTINCT t2.`collection_id` FROM
(SELECT `collection_id`,`product_id`, `option_id`, `group`, `price`
FROM `collections_products`
WHERE `collection_id`=?) t1
LEFT JOIN `collections_products` t2
ON t1.`product_id`=t2.`product_id`
AND t1.`option_id`=t2.`option_id`
AND t1.`group`=t2.`group`
AND t1.`price`=t2.`price`
AND t1.`collection_id`<>t2.`collection_id`;

Thanks for your input. Sorry, I missed something in my question: I need to search for duplicate collections based on a given `collection_id`. Edited it just now. — ᴍᴇʜᴏᴠ, Sep 03 '19 at 09:28
@aexl I'm not sure I completely understand your question then. The first thing that comes into my mind would be adding `WHERE colledtion_id=?` clause to the query after `FROM collections_products` line in both cases. This will give you all rows with the same product, option, group and price within a specific collection. Is this what you need? — Sergey Kudriavtsev, Sep 03 '19 at 09:33
I need to find other collections having the same content as a given collection. Content = same products, options, groups and prices. — ᴍᴇʜᴏᴠ, Sep 03 '19 at 09:36

Selecting identical items based on shared foreign IDs they have

1 Answers1