What is the best way to compare two very large lists(that do not fit in memory)?

Question

I have a program that takes each item from a list and compares it against all other items in another list. Its been working fine so far but the data is getting large and is going to exceed system memory.

I'm wondering what the best way to compare two lists that are very large(maybe 5-10 GB each list)?

Here is a very simple example of what I'm doing(except the list is huge and the values in the for loop are actually being processed/compared).

import java.util.Collection;
import java.util.HashSet;
import java.util.Arrays;

public class comparelists {
    public static void main( String  [] args ) {
        String[] listOne = {"a","b",
                "c","d",
                "e","f",
                "g","h",
                "i","j",
                "k","l"};

        String[] listTwo = {"one",
                "two",
                "three",
                "four",
                "five","six","seven"};

        for(int listOneItem=0; listOneItem<listOne.length; listOneItem++){
            for (int listTwoItem=0; listTwoItem<listTwo.length; listTwoItem++) {
                System.out.println(listOne[listOneItem] + " " + listTwo[listTwoItem]);
            }
        }

    }
}

I realize there has to be some disk IO here since it won't fit in memory and my intial approach was to save both lists as files and save a bunch of lines from listOne then stream the entire file of listTwo and then get some more lines from listOne and so on. Is there a better way? or a Java way to access the lists like I'm doing above but its swapping to disk as needed?

Are *both* lists too large to fit into memory? What comparison do you need to perform, and what will the result be? — Jon Skeet, Nov 12 '12 at 17:38
@JonSkeet Yes both are too big for memory. Each item is a bunch of data and is processed in the program. Its not simple text comparison or anything. — user1735075, Nov 12 '12 at 18:06
Attempting to perform an O(N^2) comparison on a data set which doesn't fit in memory is likely to take far too long, e.g. months or years. I suggest you rethink what assumptions you can make to simplify the problem or buy more memory (You can buy 16 GB for less than $100) You can parse a 10 GB list in about 4 minutes, but if you have to do this 100 million times (10 GB/100 B) it will take about 800 years. — Peter Lawrey, Nov 12 '12 at 18:47
@PeterLawrey didn't realize that. I guess that complicates thing a bit. What if I upgraded the memory on the servers and it was enough to store one list in memory? Would the solution be different? — user1735075, Nov 12 '12 at 18:54
@user1735075 Even if you can assume that a good portion of the data can be kept in memory the solution improves dramatically. The more memory you have the better, how much memory do you have? — Peter Lawrey, Nov 12 '12 at 19:17
Say you have enough memory, the next problem is that brute force O(n^2) is likely to be still too slow. Say each file has 100 M lines and processing each combination takes about 100 nano-seconds (which is not much), this means the whole process will take 30 years. You need to make assumptions like sorting the data and doing a merge sort which is O(N ln N) which is more likely to take hours. — Peter Lawrey, Nov 12 '12 at 19:24
@PeterLawrey Thanks Peter, I'll do some tests but I don't quite understand. If I'm comparing one list against another, who will sorting improve the performance so much when every element will need to be compared to another? — user1735075, Nov 12 '12 at 20:16
That depends on the purpose of the comparison. By sorting the list first, you can compare similar lines cutting the work dramatically. You have to find some way of cutting down the problem because O(n^2) will take far too long as I have explained. — Peter Lawrey, Nov 12 '12 at 20:22
@PeterLawrey I understand now..makes sense. Last question can I still sort the array of data if its too large for memory? — user1735075, Nov 12 '12 at 20:54
Yes, it is possible to sort files that are too large to fit into memory. The *nix command 'sort' is an example of a program that can do this. See http://stackoverflow.com/questions/930044/how-could-the-unix-sort-command-sort-a-very-large-file and http://stackoverflow.com/questions/7918060/how-do-i-sort-very-large-files. Basically you break your big list into smaller sublists that fit in memory. Sort each sublist in memory, then save the sublist to a temporary file. You end up with a lot of sublists stored in temp files, then you use merge-sort to combine the sublists into one big sorted list. — Mike Clark, Nov 12 '12 at 22:13

Mike Clark · Accepted Answer · 2012-11-20T18:26:31.430

You could put the Big Data in flat files and then stream one item of data in at a time from the files. This way only two items of data are in memory at any given time.

Obviously this isn't going to win any efficiency awards, but here's a simple example that uses data files which contain one item per line in text files:

BufferedReader readerA = new BufferedReader(new FileReader("listA.txt"));
String lineA;
while ((lineA = readerA.readLine()) != null)
{
    BufferedReader readerB = new BufferedReader(new FileReader("listB.txt"));
    String lineB;
    while ((lineB = readerB.readLine()) != null)
    {
        compare(lineA, lineB);
    }
    // TODO: ensure .close() is called on readerB
}
// TODO: ensure .close() is called on readerA

If the data you're working with is too complicated to easily store one item per line in a text file, you could do a similar thing with ObjectInputStream and ObjectOutputStream, which can read and write one Java object at a time to a file.

If you can manage to fit listB in memory, then obviously you'd save quite a bit of disk access inside the first loop. Memoization might help you fit listB into memory if you have enough duplicate data.

Also the comparison of items is a textbook example a problem that could be sped up by using parallelization. E.g. hand the data comparison work off to worker threads so that the file-read thread can focus on maximizing throughput from the disk.

@PeterLawrey Your comments on execution time were insightful and I upvoted them. He didn't specify execution time constraints, though, and I had no idea how far into the problem he had gotten. So I thought to start with the most basic code to which problem-specific optimizations could hopefully be applied. For example your pre-sorting suggestion would be interesting if the data lends itself to that type of optimization. — Mike Clark, Nov 12 '12 at 21:27
+1 your answer btw, as its the best you can do given the information in the question. — Peter Lawrey, Nov 13 '12 at 09:37

score 1 · Answer 2 · answered Nov 12 '12 at 17:39

1

Use the Flyweight pattern. Here is a link:

http://en.wikipedia.org/wiki/Flyweight_pattern

answered Nov 12 '12 at 17:39

Yves_T

1,170
2
10
16

score 0 · Answer 3 · answered Nov 12 '12 at 18:04

I can see that you're aiming to perform something on the Cartesian product of 2 very large lists.

And I assume that the inefficiency you're worrying about is the time to read the list from file into main memory.

How about deviding the list into blocks that you can load into the memory. Say l1[0] is the list of the first 1000 items in l1 and l1[1] is the list of the next 1000 items.

Then you want to compare:

l1[0] with l2[0]
l1[0] with l2[1]
l1[0] with l2[2]
...
l1[0] with l2[0]
l1[1] with l2[1]
l1[2] with l2[2]
...

to acheive the same total effect with less reading from a file.

What is the best way to compare two very large lists(that do not fit in memory)?

3 Answers3