I have a huge file, composed by ~800M rows (60g). Rows can be duplicates and are composed by an id and a value. For example:
id1 valueA
id1 valueB
id2 valueA
id3 valueC
id3 valueA
id3 valueC
note: ids are not in order (and grouped) as in the example.
I want to aggregate rows by keys, in this way:
id1 valueA,valueB
id2 valueA
id3 valueC,valueA
There are 5000 possible values.
The file doesn't fit in memory so I can't use simple Java Collections. Also, the greatest part of the lines are single (like id2 for example) and they should be written directly in the output file.
For this reason my first solution was to iterate twice the file:
- In the first iteration I store two structures, with only ids and no values:
- single value ids (S1)
- multiple values ids (S2)
- in the second iteration, after discarding single value ids (S1) from memory, I could write directly single values id-value pairs to the output file checking if they are not in multiple values ids (S2)
The problem is that I can not finish the first iteration cause to memory limits.
I know that the problem could be faced in several ways (key-value store, map reduce, external sort).
My question is what method could be more adapt to use and fast to implement? It is an only once process and I prefer to use Java methods (not external sort).