6

I have a huge file, composed by ~800M rows (60g). Rows can be duplicates and are composed by an id and a value. For example:

id1   valueA
id1   valueB 
id2   valueA 
id3   valueC
id3   valueA
id3   valueC

note: ids are not in order (and grouped) as in the example.

I want to aggregate rows by keys, in this way:

id1   valueA,valueB
id2   valueA
id3   valueC,valueA

There are 5000 possible values.

The file doesn't fit in memory so I can't use simple Java Collections. Also, the greatest part of the lines are single (like id2 for example) and they should be written directly in the output file.

For this reason my first solution was to iterate twice the file:

  • In the first iteration I store two structures, with only ids and no values:
    • single value ids (S1)
    • multiple values ids (S2)
  • in the second iteration, after discarding single value ids (S1) from memory, I could write directly single values id-value pairs to the output file checking if they are not in multiple values ids (S2)

The problem is that I can not finish the first iteration cause to memory limits.

I know that the problem could be faced in several ways (key-value store, map reduce, external sort).

My question is what method could be more adapt to use and fast to implement? It is an only once process and I prefer to use Java methods (not external sort).

miccia4
  • 184
  • 10
  • 2
    Is the file ordered (first id1, then id2, then id3 and so on)? If not, are they at least grouped together? Are there always empty rows between differend ids? – tkausl Jan 21 '16 at 06:43
  • Thanks for the questions, effectively the example was wrong. I modified the question. – miccia4 Jan 21 '16 at 06:45
  • Can you specify the file type? Either is it a txt or property file? – janani Jan 21 '16 at 06:54
  • 2
    How is this offtopic? It gives a precise problem, a possible approach and the issues with this approach. I'm voting to reopen. – Sergei Tachenov Jan 21 '16 at 07:01
  • 2
    1. how many unique IDs (approximately) are there in a file? 2. is this a one time job (that requires just a solution) or periodic job (that requires effective solution) ? – AdamSkywalker Jan 21 '16 at 08:14
  • Simplest way is use any DBMS – Ken Bekov Jan 21 '16 at 08:31
  • So what exactly you wanna to get? As I understand, you need transform one huge file to another one huge file? Or you need to collect some kind of statistics? – Ken Bekov Jan 21 '16 at 08:36

3 Answers3

3

As already said (that's quick!), merge-sort is one approach. Concretely, sort locally by id, say, every 1 million lines. Then save the locally sorted lines into smaller files. And then repetitively merge up the smaller, sorted files in pairs into one big sorted file. You can do aggregation when you merge up the smaller files.

The intuition is that, when you merge up 2 sorted lists, you maintain 2 pointers, one for each list, and sort as you go. You don't need to load the complete lists. This allows you to buffer-in big files and buffer-out the merged results immediately.

Here is sample code to sort in-memory and output to a file:

private void sortAndSave(List<String> lines, Path fileOut) throws IOException {
    Collections.sort(lines, comparator);
    Files.write(fileOut, lines);
}

Here is sample code to sort locally and save the results into smaller files:

// Sort once we collect 1000000 lines
final int cutoff = 1000000;
final List<String> lines = new ArrayList<>();
int fileCount = 0;
try (BufferedReader reader = Files.newBufferedReader(fileIn, charset)) {
    String line = reader.readLine();
    while (line != null) {
        lines.add(line);
        if (lines.size() > cutoff) {
            fileCount++;
            sortAndSave(lines, Paths.get("fileOut" + fileCount));
            lines.clear();
        }
        line = reader.readLine();
    }
    if (lines.size() > 0) {
        fileCount++;
        sortAndSave(lines, fileOut, Paths.get("fileOut" + fileCount));
    }
}

Here is sample code to merge sort 2 files:

try (BufferedReader reader1 = Files.newBufferedReader(file1, charset);
     BufferedReader reader1 = Files.newBufferedReader(file2, charset);
     BufferedWriter writer = Files.newBufferedWriter(fileOut, charset)) {
    String line1 = reader1.read();
    String line2 = reader2.read();
    while (line1 != null && line2 != null) {
        if (comparator.compare(line1, line2) < 0) {
            writer.write(line2);
            line2 = reader2.read();
        } else {
            writer.write(line1);
            line1 = reader1.read();
        }
    }
    if (line1 != null) {
        // TODO: Merge in the remaining lines of file1
    } else if (line2 != null {
        // TODO: Merge in the remaining lines of file2
    }
}
neurite
  • 2,798
  • 20
  • 32
  • I didn't vote down, but I still don't get: you propose to split into 1M files and then you talk about merging two files. But you'll end up with more than two files, so you need to use K-way merge. – Sergei Tachenov Jan 21 '16 at 09:18
  • 2
    You can merge 2 files at a time, 3 files at a time, doesn't matter. For example, if we split into 60 files, one loop of 2-file merging will reduce it to 30 files, the 2nd loop, 15 files, ..., till you are left with one file. You can do k-way merge, 2-way merge is a special case. Either way, the number of merging loops is bounded at lg(N). – neurite Jan 21 '16 at 09:27
  • I am unhappy about the vote-down because I read carefully the question, gave a solution by the requirements, provided the rationale, and went head over heals writing code for it. For the person who cast the down-vote, did you even read the code? – neurite Jan 21 '16 at 09:33
  • Random unexplained downvotes do happen here, happened to me several times. I doubt they'll go back to explain so better just ignore it. I've upvoted now that you've explained, if that is of any help. About merging, wouldn't it be better to just perform one k-way merge using a heap? Less I/O this way. – Sergei Tachenov Jan 21 '16 at 10:10
  • Appreciate the vote-up. For this particular question, one k-way merge is reasonable. However if we generalize, the problem with k-way merge is that k is unknown and can be large. In the original question, the file has 800 million rows. If we don't optimize by sorting locally in-memory first, if we do strictly merge-sort from the bottom, it will maintain 800 million file handles and keep track of 800 million keys. On the other hand, if we merge in pairs, we know the complexity is reduced from O(k) to O(1). True, there is more I/O. But we are looking at N vs 2N, still O(N). – neurite Jan 21 '16 at 10:28
  • 1
    How do you get O(1) here? Two-way merge is O(n/k) where n/k is the size of one "small" file. We do it log k times, but k is decreasing, so it's O(n/k) + O(2n/k) + O(4n/k) + ... + O(n) = O(n log k). K-way merge is also O(n log k). And, of course, I still mean to sort those 1M-files in-memory first and then k-way merge them. – Sergei Tachenov Jan 21 '16 at 12:35
  • Sorry, that was my down vote. I pressed accidentally (viewed from smartphone), but it seemed to me that I canceled it. Obvious no. I'll do vote up your another answer (multi threading). Your solution is really not bad. Sorry, again. – Ken Bekov Jan 21 '16 at 12:51
  • It's the space complexity. K-way merge, in the worst case, will require O(K) space, where K == N at the beginning. Whereas 2-way merge, is O(1) space complexity. – neurite Jan 21 '16 at 17:42
  • @KenBekov, no worries. Was very stressful last night and wondered off to stackoverflow to "take a break". Guess I was still in the high-octane mode. Now I'm cool. – neurite Jan 21 '16 at 17:50
  • Thanks for the idea. I realized extactly this and worked well. – miccia4 Jan 29 '16 at 17:39
1

when dealing with such large amount of data, we need to think out of the box - and buffer the entire thing

First: how it's working already?

lets say that i have 4gb of a video and i'm trying to load it to my video player.. my player would basically need to perform 2 main operations:

buffering - 'splitting' the video to chunks and read one chunk at a time (buffer)

streaming - displaying the result (video) to my software (player)

why? because it would be impossible to load everything to memory at once (and we don't even really need it... at a specific moment the user observes a portion of the video from the buffer (which is a portion of the entire file)

Second: how this can help us?

we can do the same thing for large files:

  1. split the main file into smaller files (each file contains X rows where X is the 'buffer')
  2. load it to java and group it
  3. save the result to a new file

after this process we have many small files that contains information like this

id1   valueA,valueB
id2   valueA
id3   valueC,valueA

so each grouped file contains less rows the the original small file it derived from

  1. we can now merge it back and try to load it to java and re-group everything
  2. if the process fails (still too big) we can merge the small grouped files into several grouped files (and repeat the process)
ymz
  • 6,602
  • 1
  • 20
  • 39
  • The problem is that the OP says that most IDs are single-value, so grouping won't decrease the number of rows by that much. – Sergei Tachenov Jan 21 '16 at 07:57
  • 800M rows with 5000 possible values... i'll take my chances :) – ymz Jan 21 '16 at 07:59
  • I didn't get it at first too. But think about it: 5000 values doesn't mean that they all have the same IDs. So there can be 600M IDs with a lot of repeated values. And since we're grouping by IDs, not by values, that doesn't help us much. – Sergei Tachenov Jan 21 '16 at 08:07
  • in that case, grouping would be very rare - therefore splitting the large file into smaller ones would help to investigate each value **separately** and observe whenever it has multiple id's – ymz Jan 21 '16 at 08:14
  • Yes, I imagine grouping will be very rare. But why investigate whether the value has multiple IDs? We don't need that. We need to observe whether an ID has multiple values. – Sergei Tachenov Jan 21 '16 at 08:16
0

The file doesn't fit in memory so I can't use simple Java Collections. Also, the greatest part of the lines are single (like id2 for example) and they should be written directly in the output file.

My solution would be, use the BufferedReader for reading your large file(Obviously the only way).

Store the key-value pair in Redis (If you are on a linux environment) or Mongo DB (if you on Windows)

Community
  • 1
  • 1
Jude Niroshan
  • 4,280
  • 8
  • 40
  • 62