A database has collections of products; each collected product has a price recorded at the moment of being added to a collection, with a few other values.
// `collections_products`
id collection_id group product_id option_id price
1 1 0 56 0 3.1920
2 1 0 56 54 1.2000
3 1 0 56 55 2.4000
4 1 0 56 56 3.6000
5 1 0 56 57 4.8000
6 1 0 56 58 6.0000
7 1 0 57 0 3.1920
8 1 0 57 54 1.2000
11 10 0 56 0 3.1920
12 10 0 56 54 1.2000
13 10 0 56 55 2.4000
14 10 0 56 56 3.6000
15 10 0 56 57 4.8000
16 10 0 56 58 6.0000
17 10 0 57 0 3.1920
18 10 0 57 54 1.2000
21 100 0 56 0 9.9999
22 100 0 56 54 9.9999
23 100 0 56 55 9.9999
24 100 0 56 56 9.9999
25 100 0 56 57 9.9999
26 100 0 56 58 9.9999
27 100 0 57 0 9.9999
28 100 0 57 54 9.9999
31 1000 0 56 0 3.1920
32 1000 0 56 54 1.2000
33 1000 0 56 55 2.4000
34 1000 0 56 56 3.6000
36 1000 0 56 58 6.0000
37 1000 0 57 0 3.1920
38 1000 0 57 54 1.2000
Having some collection_id
, I need to find other identical, duplicate collections (having identical content, i.e. same products, groups and options at same prices; order not important) to a given one.
In the examples above:
- the set of rows with
collection_id
10 (set B) is a duplicate of the set of rows withcollection_id
1 (set A); for every row in A there is another row in B with an identicalgroup product_id option_id price
, and A and B have the same number of rows - the set of rows with
collection_id
100 is NOT a duplicate of any other because all the prices are different - the set of rows with
collection_id
1000 is NOT a duplicate of any other because the count of rows are different (row id 35 is missing compared tocollection_id
1)
Came up with:
- Have a select query looking for other collections based on what common IDs and values they have, all in one SQL statement, but unsure if this is at all possible with MySQL
- Calculate a checksum of each collection's result set (group, product_id, option_id, price of each row, together), store it as
collections.checksum
, re-calculate each time there's movement inside a collection. When searching, get the checksum of the collection I have and select by that checksum.
Researched the checksum idea. Found:
- MySQL rows checksum & mySQL: get hash value for each row?: checksums the individual rows, but not a result set
- Checksum of SELECT results in MySQL: uses CRC32 and has an Expected collisions warning, which looks reasonable
Don't want to reinvent the wheel. Surprised I can't find anything reusable, unless I'm looking in a wrong direction.
What would be the right way to approach this? Please advice
UPDATE I'm not looking to delete any collections, even if they're duplicates. I need to combine them instead. This is a half-made-up example, sorry if it doesn't make 100% sense