How to use minimum memory when analyzing data from a huge file

Question

I have two huge files and each file will store integer values each line. The format looks like below:

File1:

File2:

I need to find a way to read from these two files and find the duplicate values on these two files. Let's consider above example, the value 11 will be printed since it appears on both of the files. Reading from file and looping the values are easy. But the problem is that the number of lines for each file is far more than Integer.MAXIMUM. So I can't read the whole files into memory otherwise I will run out of memory. Is there any efficient way to solve this problem and consuming less memory?

EDIT1

I want to find a solution which not read all the data into memory. It would be better to read a part of the file and do analyze then continue reading. But I don't know how to achieve my goal.

`But the problem is that the number of lines for each file is far more than Integer.MAXIMUM` ... then why not use a collection with `Long`? — Tim Biegeleisen, Oct 31 '16 at 12:06
It is also a problem. Even use Long, still I will run out of memory. I want to find a solution to solve the problem without reading all the data to memory. — Joey Yi Zhao, Oct 31 '16 at 12:08
Is there any data limitation ? Eg numbers in file are in range [0,N] ? — Antoniossss, Oct 31 '16 at 12:09
Then you will have to split files into ordered subsets of fixed data range. — Antoniossss, Oct 31 '16 at 12:11
if you didn't care about speed, you could read the first line from file A, then read one line at a time in file B looking for dupes. Then go to the second line in File A and cycle through File B again. That would only have 1 line form each file in memory, plus remembering the dupes. This is highly inefficient though. So now you know the two extremes, you wants something in the middle. Hint: Sorting them first might help. — Brian Pipa, Oct 31 '16 at 12:12
I will give you a hint... you can use stream API instead of reading the file in full if you want to do it in Java. In unix simply `comm -12 < (sort file1) < (sort file2)` . Matt — Palcente, Oct 31 '16 at 12:13
@Antoniossss I know it will take a long time - that was my point.Trying to get him to think differently about the problem. — Brian Pipa, Oct 31 '16 at 12:14
@BrianPipa Thanks for your solution but I really don't want to do time^2. — Joey Yi Zhao, Oct 31 '16 at 12:14
@Palcente How to keep the value when using stream API? I need to compare the values between these two files. If I use stream API, the previous values may be gone right? — Joey Yi Zhao, Oct 31 '16 at 12:16
@ZhaoYi the best you could do would be `O(n log n)` when sorting both files. There's no faster approach than sorting and linear comparison — , Oct 31 '16 at 12:17
What is the range of integer values in the files? If not too large, you could use an array to keep a count of the number of times each value appears in each file and use that to find the duplicates. — pcarter, Oct 31 '16 at 12:21
@BrianPipa but this one is crazy long (lifetime of the universe ?? - kidding ;P) — Antoniossss, Oct 31 '16 at 12:32
can you have duplicates within each files? I mean could we have `11` twice in File 1 and 3 times in File 2 ? — Nicolas Filotto, Oct 31 '16 at 12:47
@NicolasFilotto it seems that it is inevitable that there will be duplicates within each file, since each file contains more than Integer.MAX_VALUE values, and you can't have more than Integer.MAX_VALUE distinct values (unless, of course, the file also contains negative values). — Klitos Kyriacou, Oct 31 '16 at 13:48
@KlitosKyriacou I believe you are right but it assumes that the values are all < `Integer.MAX_VALUE` which is not clear from the question — Nicolas Filotto, Oct 31 '16 at 14:33
It's not clear from the question, which says "each file will store integer values"... it depends on whether the word "integer" is used here in the mathematical sense or the Java sense. — Klitos Kyriacou, Oct 31 '16 at 16:07

Gerald Mücke · Accepted Answer · 2016-10-31T14:20:27.797

A naive approach to minimize memory would be to read every file line by line ("for each line of file 1 do for each line of file 2 compare line"), but might take quite long as the bottle neck is disk I/O. It's on the lower end of the spectrum - minimum memory, maximum duration. On the other end would be to create huge heaps and operate entirely on memory that - maximum memory, minimum duration. So the best solution would be something between, balancing tradeoffs, which usually requires some more upfront thinking.

analyze the input data

With two files only containing integers (32bit) values, and both files containing >2^32 entries you certainly will have duplicates already in the first file.

But to determine, if file 2 contains a duplicate, we only need to know if the same value occurs at least once in the first file. That just 1 bit information we have to keep in memory, not the entire value. Because the value-range is limited to integer, and for each possible value we have to know if it occurs at least once, we need a bit-set with 2^32 bits.

In a single long value we can store the information for 64 values of your file 64 bits). So we need a long array of size 67'108'864 to store the entire min-occurence information of file 1 in memory. That's around 512 MB of memory you need.

After you have read this representation of file 1 into memory. You could scan file 2 line by line and check for each value, if it occurs at least once in file 1 using the array we've created and print out if its a duplicate or not (or write it to another file or into a datastrcture).

the smart version

In case you want to get your hands dirty, continue reading. If you want to use what's out of the JDK box, use a BitSet of size Integer.MAX_VALUE (thanks @Mzf).

    BitSet bs = new BitSet(Integer.MAX_VALUE);
    bs.set(valueFromFile1); 
    boolean duplicate = bs.get(valueFromFile2);

the version for men with beards who run as root:

The structure of the lookup array is like

[ 0000 0001 0000 0000 ... , 0000 0000 1000 000 ..., ...]
          ^                           ^
          7 (0*64 + 7)                74 (1*64 + 8)

What you need to have is a conversion from int value to index position and bit-offset.

int pos(int value){
    return value / 64;
}
long offsetMask(int value){
    return 1L << (value % 64)
}
boolean exists(long[] index, int value) {
    return (index[pos(value)] & offsetMask(value)) != 0;
}


long[] index = new long[67108864];

//read references file
Scanner sc = new Scanner(new File("file1"));
while (sc.hasNextInt()) {
    int value = sc.nextInt();
    index[pos(value)]  |= offsetMask(value);
}

//find duplicates
Scanner sc = new Scanner(new File("file2"));
while (sc.hasNextInt()) {
    int value = sc.nextInt();
    boolean result = exists(index, value)
    if(result) {
      System.out.println("Duplicate: " + value);
    }
}

(it's basically the same what's done in the BitSet)

It doesn't matter if files are larger as longe as the value range does not increase you do not need more than 512 MB.

since this java question - you can use the BitSet. https://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html — Mzf, Oct 31 '16 at 14:07
yes, it does basically the same, and is officially supported :) thanks for the pointer — Gerald Mücke, Oct 31 '16 at 14:13

score 0 · Answer 2 · answered Oct 31 '16 at 12:17

0

First sort each file smallest to largest. Now read the first line from each file. Compare. If they match, record a duplicate, then go to the next one either file (I choose file A). If they don't match, go to the next line in the in the file that had the smaller one. Repeat til you reach the end of one file.

answered Oct 31 '16 at 12:17

Brian Pipa

808
9
23

How do you implement sorting as Zhao mentioned "not read all the data into memory." Why not use binary search after sorting? – Saurav Sahu Oct 31 '16 at 12:21
Sorting the file also require loading it into memory right? – Joey Yi Zhao Oct 31 '16 at 12:24
If you cannot sort the files beforehand, you will have to accept quadratic complexity. – Palcente Oct 31 '16 at 12:25
@ZhaoYi file sort can be done without keeping it in memory. http://stackoverflow.com/questions/7918060/how-do-i-sort-very-large-files – Alexander Anikin Oct 31 '16 at 13:33

score 0 · Answer 3 · edited May 23 '17 at 12:33

0

There are several ways to solve this problem. But to start with, you don't have to read the entire file into memory but read it line by line using FileUtils - see several examples here.

You said that the number of lines is bigger than Integer.MAXIMUM. So you can read the first file, line by line, save the disticnt number in a Hashmap. Now read the second file line by line , for each line search the number in the map - if it exist - print it. The max anount of memory used is Integer.MAXIMUM

If you want to reduct the memory footprint you can use bit map instead of map so instead of useing sizeof(int) * #distinct_numbers_in_first_file you will use const size of Integer.MAXIMUM bits

If you know something about the input - you can choose the right solution for you

In Both cases, the memory used is limited and doesn't load entire file into memory

edited May 23 '17 at 12:33

Community

1
1

answered Oct 31 '16 at 12:25

Mzf

5,210
2
24
37

number of distinct values is larger than 'Integer.MAXIMUM' – Palcente Oct 31 '16 at 12:28
@Palcente - this is not what written in the question `But the problem is that the number of lines for each file is far more than Integer.MAXIMUM` not the number of distinct values - or I miss something ? – Mzf Oct 31 '16 at 12:30
@Palcente how can the number of distinct values _possibly_ be larger than Integer.MAX_VALUE? There are only Integer.MAX_VALUE integers in existence! (Except negative ones...) – Klitos Kyriacou Oct 31 '16 at 14:04
@KlitosKyriaco this is some serious 32 bit heresy I'm reading here – Palcente Oct 31 '16 at 14:25

Antoniossss · Answer 4 · 2016-10-31T12:43:34.813

First you will have to split you input into smaller files with ordered data.

These files should have fixed data range so first file would have distinc values in range of 0-1000000, second 1000001-2000000 and so on and so on.

If you do this for each input, you will end up with ordered "buckets" of values in given range. Al you will have to do now is to compare those "buckets" agains eachother to get duplication values.

This consumes disk space in exchange to memory usage.

This is how I would try to solve this at first glance.

How to use minimum memory when analyzing data from a huge file

4 Answers4

analyze the input data

the smart version

the version for men with beards who run as root: