Merge 2 large csv files using inner join

Question

I need the advice from someone who knows very well java and the memory issues. I have a large CSV files (something like 500mb each) and I need to merge these files in one using only 64mb of xmx. I've tried to do it different ways, but nothing works - always got memory exception. What should I do to make it work properly?

The task is: Develop a simple implementation that joins two input tables in a reasonably efficient way and can store both tables in RAM if needed.

My code works, but it takes alot of memory, so can't fit at 64mb.

public class ImprovedInnerJoin {
public static void main(String[] args) throws IOException {

    RandomAccessFile firstFile     = new RandomAccessFile("input_A.csv", "r");
    FileChannel      firstChannel = firstFile.getChannel();
    RandomAccessFile secondFile     = new RandomAccessFile("input_B.csv", "r");
    FileChannel      secondChannel = secondFile.getChannel();
    RandomAccessFile resultFile     = new RandomAccessFile("result2.csv", "rw");
    FileChannel      resultChannel = resultFile.getChannel().position(0);

    ByteBuffer resultBuffer = ByteBuffer.allocate(40);
    ByteBuffer firstBuffer = ByteBuffer.allocate(25);
    ByteBuffer secondBuffer = ByteBuffer.allocate(25);

    while (secondChannel.position() != secondChannel.size()){
        Map <String, List<String>>table2Part = new HashMap();
        for (int i = 0; i < secondChannel.size(); ++i){
            if (secondChannel.read(secondBuffer) == -1)
                break;
            secondBuffer.rewind();
            String[] table2Tuple = (new String(secondBuffer.array(), Charset.defaultCharset())).split(",");
            if (!table2Part.containsKey(table2Tuple[0]))
                table2Part.put(table2Tuple[0], new ArrayList());
            table2Part.get(table2Tuple[0]).add(table2Tuple[1]);
            secondBuffer.clear();
        }

        Set <String> taple2keys = table2Part.keySet();
        while (firstChannel.read(firstBuffer) != -1){
            firstBuffer.rewind();
            String[] table1Tuple = (new String(firstBuffer.array(), Charset.defaultCharset())).split(",");
            for (String table2key : taple2keys){
                if (table1Tuple[0].equals(table2key)){
                    for (String value : table2Part.get(table2key)){
                        String result = table1Tuple[0] + "," + table1Tuple[1].substring(0,14) + "," + value; // 0,14 or result buffer will be overflown
                        resultBuffer.put(result.getBytes());
                        resultBuffer.rewind();
                        while(resultBuffer.hasRemaining()){
                            resultChannel.write(resultBuffer);
                        }
                        resultBuffer.clear();
                    }
                }
            }
            firstBuffer.clear();
        }
        firstChannel.position(0);
        table2Part.clear();
    }

    firstChannel.close();
    secondChannel.close();
    resultChannel.close();
    System.out.println("Operation completed.");
}
}

Can you just put it into a database and then join them there or does that count against you? — Rabbit Guy, Feb 19 '16 at 19:49
you could do an external merge sort and then join them line-by-line. — Thomas Jungblut, Feb 19 '16 at 20:05
if you have a unix system, the `sort` command does exactly that. I can also post an example on github how to do it with Java only, I have several classes for external merge sorting. — Thomas Jungblut, Feb 19 '16 at 21:48

Thomas Jungblut · Accepted Answer · 2016-05-24T13:20:55.213

A very easy to implement version of an external join is the external hash join. It is much easier to implement than an external merge sort join and only has one drawback (more on that later).

How does it work?

Very similar to a hashtable. Choose a number n, which signifies how many files ("buckets") you're distributing your data into.

Then do the following:

Setup n file writers
For each of your files that you want to join and for each line:
- take the hashcode of the key you want to join on
- compute the modulo of the hashcode and n, that will give you k
- append your csv line to the kth file writer
Flush/Close all n writers.

Now you have n, hopefully smaller, files with the guarantee that the same key will always be in the same file. Now you can run your standard HashMap/HashMultiSet based join on each of these files separately.

Limitations

Why did I mentioned hopefully smaller files? Well, it depends on the distribution of the keys and their hashcodes. Think for the worst case, all of your files have the exact same key: you only have one file and you didn't win anything from partitioning.

Similar for skewed distributions, sometimes a few of your bucket files will be too big to fit into your RAM. Usually there are three ways out of this dilemma:

Run the algorithm again with a bigger n, so you have more buckets to distribute to
Take only the buckets that are too big and do another hash partitioning pass only on those files (so each file goes into n newly created buckets again)
Fallback to an external merge sort on the big partition files.

Sometimes all three are used in a different combinations, which is called dynamic partitioning.

score 0 · Answer 2 · edited May 23 '17 at 11:58

If central memory is a constraint for your application but you can access a persistent file, I would create as suggested by blahfunk a temporary SQLite file to your tmp folder, read every file by chunks and merge them with a simple join. You could could create a temporary SQLite DB by giving a look to libraries such as Hibernate, just take a look to what have I found on this StackOverflow question: How to create database in Hibernate at runtime?

If you cannot perform such a task, your remaining option is to consume more cpu and load just the first row of the first file searching for a row with the same index on the second file, buffering the result and flushing them as late as possible on the output file, repeating this for every row of the first file.

As an addendum, as said by Thomas Jungblut, performing a preliminary sort on the two csv files will surely speed up the process. See this: [file based merge sort on large datasets in Java](http://stackoverflow.com/questions/6314598/file-based-merge-sort-on-large-datasets-in-java) — Fergus Rataporn, Feb 19 '16 at 20:18

score 0 · Answer 3 · answered Feb 19 '16 at 20:41

Maybe you can stream the first file and turn each line into a hashcode and save all those hashcodes in memory. Then stream the second file and make a hashcode for each line as it comes in. If the hashcode is in the first file, i.e., in memory, then don't write the line, else write the line. After that, append the first file in its entirety into the result file.

This would be effectively creating an index to compare your updates to.

Merge 2 large csv files using inner join

3 Answers3