the problem is simple and I found several answers on how to proceed but I need a more specific help because of the size of the problem. Here is the situation:
- I have several (let's say 20) collections of c++ objects (all of the same type)
- Each collection contains hundreds of million of entries
- The same entry could be present in more than one of the 20 collections
- Each collection is made by few thousand files, each one around 4GB. Each collection is around 50TB and the total size of the collection is around 1PB
- CPU Resource available: few thousand nodes (each one with 2GB RAM and a reasonable new CPU). All of them can run asynchronously accessing one by one all the files of the collections
- Disk Resource available: I cannot save a full second copy of all collections (I don't have another PB of disk available) but I can reduce the size of each entry keeping only the relevant information. Final reduced size of all collection would be less than 100TB and that's ok.
What I would like to do is to merge the 20 collections to get a single collection with all the entries removing all the duplicates. The total numeber of entry is around 5 billion and there are few percent of duplicated events (let's say around 3-5%).
Another important information is that the total size (all the 20 original collections) is more than 1PB so it's really an heavy task to process the full set of collections.
Finally: at the end of the merging (i.e. when all the duplicates have been removed) the final collection has to be processed several times... so the output of the merging will be used as input to further processing steps.
Here is an example:
Collection1
------------------------------------------
| | n1 | n2 | n3 | value1...
------------------------------------------
entry0: | 23 | 11 | 34 | ....
entry1: | 43 | 12 | 24 | ....
entry2: | 71 | 51 | 91 | ....
...
Collection2
------------------------------------------
| | n1 | n2 | n3 | value1...
------------------------------------------
entry0: | 71 | 51 | 91 | ....
entry1: | 73 | 81 | 23 | ....
entry2: | 53 | 22 | 84 | ....
...
As you see there are 3 integers that are used for distinguish each entry (n1,n2 and n3) and in collection1 entry2 has the same 3 integers as entry0 in collection2. The latter is a duplication of the former... Merging these 2 collections would give a single collection with 5 entries (having removed entry0
The collections are not sorted and each collection is made by thousands of files (typical file size 4GB and a single collection is tenths of TB)
Any suggestion on which is the best approach?
Thanks for helping