Multiple objects lists: how-to merge removing duplicates

Question

the problem is simple and I found several answers on how to proceed but I need a more specific help because of the size of the problem. Here is the situation:

I have several (let's say 20) collections of c++ objects (all of the same type)
Each collection contains hundreds of million of entries
The same entry could be present in more than one of the 20 collections
Each collection is made by few thousand files, each one around 4GB. Each collection is around 50TB and the total size of the collection is around 1PB
CPU Resource available: few thousand nodes (each one with 2GB RAM and a reasonable new CPU). All of them can run asynchronously accessing one by one all the files of the collections
Disk Resource available: I cannot save a full second copy of all collections (I don't have another PB of disk available) but I can reduce the size of each entry keeping only the relevant information. Final reduced size of all collection would be less than 100TB and that's ok.

What I would like to do is to merge the 20 collections to get a single collection with all the entries removing all the duplicates. The total numeber of entry is around 5 billion and there are few percent of duplicated events (let's say around 3-5%).

Another important information is that the total size (all the 20 original collections) is more than 1PB so it's really an heavy task to process the full set of collections.

Finally: at the end of the merging (i.e. when all the duplicates have been removed) the final collection has to be processed several times... so the output of the merging will be used as input to further processing steps.

Here is an example:

Collection1
------------------------------------------
|        | n1 | n2 | n3 | value1...
------------------------------------------
entry0:  | 23 | 11 | 34 | ....  
entry1:  | 43 | 12 | 24 | ....  
entry2:  | 71 | 51 | 91 | ....  
...

Collection2
------------------------------------------
|        | n1 | n2 | n3 | value1...
------------------------------------------
entry0:  | 71 | 51 | 91 | ....  
entry1:  | 73 | 81 | 23 | ....  
entry2:  | 53 | 22 | 84 | ....  
...

As you see there are 3 integers that are used for distinguish each entry (n1,n2 and n3) and in collection1 entry2 has the same 3 integers as entry0 in collection2. The latter is a duplication of the former... Merging these 2 collections would give a single collection with 5 entries (having removed entry0

The collections are not sorted and each collection is made by thousands of files (typical file size 4GB and a single collection is tenths of TB)

Any suggestion on which is the best approach?

Thanks for helping

you need to ask a more specific question to get a more specific answer. Can you show a [mcve]? When you say "collection" what do you mean exactly? Are you already using a std container? Is the question what container to pick? — 463035818_is_not_an_ai, Jun 06 '19 at 10:02
so, these 20 collections are in your main memory in containers (like vectors) already or just lie in files? If they are in memory, how much GB is occupied already and how much do you have? Please also provide information on whether these collections are already sorted in some way — IceFire, Jun 06 '19 at 10:06
In the past (when memory was much more limited) this was usually done as follows: Read a sufficient part of data into memory, sort this data in memory (e.g. with quick sort or something even more modern), and write data into a (new) file. Finally, when all data has been sorted that way, merge all files to one. While merging, it's easy to detect and discard duplicates. — Scheff's Cat, Jun 06 '19 at 10:27
I added more information as requested... I hope that helps to clarify the problem — Attilio, Jun 06 '19 at 10:29
@Scheff What you suggest make sense but I'm not sure I will have the resource to produce 2 intermediate steps (1st when I write the sorted entries and then when I merge the sorted files) — Attilio, Jun 06 '19 at 10:34
I assumed that it will be even harder to provide a sufficient amount of main memory (aka RAM) than to provide a sufficient amount of free disk space for 1 PB. ;-) — Scheff's Cat, Jun 06 '19 at 10:36
@TedLyngmo of course... but the point here is not write down few lines of code on sorting... there are hundreds of working example I can use. My point is: given the size of the problem (1000TB of data on disk and 5 billions entries), I know I have to test each procedure I'm going to try to get at least a couple of number: estimation of the time needed, estimation of the disk needed. Is there anyone around who can give me some advice where to start and which procedure/algorithm could be more efficient? — Attilio, Jun 06 '19 at 10:40
May be, you can sort the data in files "in place" (e.g. using memory mapped files) to prevent need of extra space. Concerning in-place sorting, I found [SO: Sorting in place](https://stackoverflow.com/questions/16585507/sorting-in-place). — Scheff's Cat, Jun 06 '19 at 10:43
@sheff and strange as it may seem... probably RAM is not the major problem... I could use thousands of machine and split the sorting in multiple steps as much as I like... but then where do I save the intermediate steps? — Attilio, Jun 06 '19 at 10:44
@Attilio Having seen the same approach in the past, I foresee a good generic answer coming soon - and a respons "_but that doesn't really fit because ..._" ... — Ted Lyngmo, Jun 06 '19 at 10:45
@TedLyngmo I know it's not easy to deal with simple problems when you scale to huge dimensions... honestly I knew it was just an attempt... but you never know... and I'm going to read Sheff suggestion... maybe "Sorting in place" could give me some new idea... — Attilio, Jun 06 '19 at 10:52
OK, we're speaking about [SETI@home](https://en.wikipedia.org/wiki/SETI@home)... ;-) Well, if parts of data are sorted a K computers, afterwards the final output has been merged reading from the K computers concurrently. For a merge step, you only need K data items (from each separate sorted set the head). If the K computers aren't usable at the same time... then, of course, it's necessary to store the intermediate (sorted data sets) somewhere. — Scheff's Cat, Jun 06 '19 at 10:52
In your example, 3 integers are used for identifying each entry. How many distinct entries can you have? Or, what is the possible range of each integer? — RobertBaron, Jun 06 '19 at 11:45
https://stackoverflow.com/questions/55740183/what-are-some-viable-strategies-to-detecting-duplicates-in-a-large-json-file-whe/55762539#55762539 — Matt Timmermans, Jun 06 '19 at 12:10
@RobertBaron For what concern range, just one of them has a big range: 0->10^10, another one is 0->1000 and the third one is 0->10000 — Attilio, Jun 07 '19 at 18:38
Algorithms for doing what you want to do are all well known. Your problem is more one of "how can I use my available resources to solve my problem?" The question does not say anythng about resources, so people assume one typical computer. In a comment, you mentioned that you can use 1000's of computers. Well this is certainly relevant information to add to your question. How much of these computers can you use? What other resources do you have? Etc. — RobertBaron, Jun 07 '19 at 18:51
@RobertBaron I edited the question adding more details... I don't know if you are familiar with it but it's a GRID problem: [wlcg](http://wlcg.web.cern.ch/) — Attilio, Jun 07 '19 at 19:09
Where are the files of the collections stored? All at one node? Distributed over several nodes? Also, what sort of network bandwidth do we have between nodes? — RobertBaron, Jun 07 '19 at 19:18
@RobertBaron files are scattered all around the world but network is not a problem... all the sites where files are stored are connected with at least 10Gbps connection. I would say that you can consider all the files reachable as they are locally stored... normally processes run on nodes on the same sites where the collection is stored. — Attilio, Jun 07 '19 at 22:24
Given the speed of the network, I assume that the files are to stay wherever they are. Is this correct? — RobertBaron, Jun 07 '19 at 22:34

score 0 · Answer 1 · answered Jun 06 '19 at 11:36

0

I hope your objects may be ordered? o1 <= o2 <= oN ... Load one collection in the memory and sort it. Save it to the disk. Get next collection. Sort it. Merge the two collections on the disk and delete the first one. Get next collection ...

answered Jun 06 '19 at 11:36

Simo Simov

97
5

With about `1PB` and `say 20 collections`, `Load one collection in the memory` may strain main memory (address space, even). Even 50 TB of SLC sounds costly as of 2019. – greybeard Jun 06 '19 at 11:53
1. You don't need to have more than one of the collections in the memory. Also you may merge-sort it on the disk. 2. You may use 64-bit application and address space won't be a problem. – Simo Simov Jun 06 '19 at 12:18
*One* collection out of 20 for a total of 1PB is bound to be about 50 TB. `64-bit application and address space won't be a problem` not logical address space, but does your processor use serial address signals? If not, how many address lines does it have/use? – greybeard Jun 06 '19 at 12:27
(Intel's *8th and 9th Generation Intel® Core™ Processor Families and Intel® Xeon®E Processor FamilyDatasheet*: `addressing above 512 GB is not supported`) – greybeard Jun 06 '19 at 14:41
I can't load in memory one collection... it's too big... my atomic unit is a single file... around 4GB and all the virtual node I could use (several thousand of them actually) have the same characteristic: 2GB of RAM. – Attilio Jun 07 '19 at 18:43
@Attilio You don't need to load an entire collection in memory to merge-sort it. Load as many entries as you can and sort them. Save the result on the disk with unique file name in given directory. Continue until no more entries. Than you merge 2 by 2 all the files and continue again until only 1 file remained. That's the way Linux command sort works. – Simo Simov Jun 10 '19 at 15:16

RobertBaron · Answer 2 · 2019-06-08T09:20:34.020

Given the speed of your network and the number of available nodes, here is one way you could proceed.

You have a total of about 5G entries, and 20 collections. So, on average, 250M entries per collection. Duplicate entries between collections are on the order of 3-5% (7-12M entries). Now, because you have 20 collections scattared over thousands of nodes, each collection is most likely scattered over multiple nodes.

Here are the general steps of what you could do.

For each one of your collections, create a database on a chosen node, where you will store all the entry Id's of the collection. That database will be on the order of a few GBs.
On each node, run a process that scans all files at the node, and add the entry Id's into the collection database.
On a single node, run a process that reads from all collection databases and finds duplicates. When a duplicate is found in two collection, remove the entry Id from one of the two collections.
Run a process on each node, to delete from the files at the node, all entries whose Id's are not in their collection database.

In the end, all duplicates have been eliminated, and you also get 20 databases with the Id's of all the entries in each collection.

@Attilio Does this solution make any sense? – RobertBaron Jun 10 '19 at 11:12 — RobertBaron, Jun 10 '19 at 11:12

Multiple objects lists: how-to merge removing duplicates

2 Answers2