A naive approach to minimize memory would be to read every file line by line ("for each line of file 1 do for each line of file 2 compare line"), but might take quite long as the bottle neck is disk I/O. It's on the lower end of the spectrum - minimum memory, maximum duration.
On the other end would be to create huge heaps and operate entirely on memory that - maximum memory, minimum duration.
So the best solution would be something between, balancing tradeoffs, which usually requires some more upfront thinking.
analyze the input data
With two files only containing integers (32bit) values, and both files containing >2^32 entries you certainly will have duplicates already in the first file.
But to determine, if file 2 contains a duplicate, we only need to know if the same value occurs at least once in the first file. That just 1 bit information we have to keep in memory, not the entire value.
Because the value-range is limited to integer, and for each possible value we have to know if it occurs at least once, we need a bit-set with 2^32 bits.
In a single long value we can store the information for 64 values of your file 64 bits). So we need a long array of size 67'108'864 to store the entire min-occurence information of file 1 in memory. That's around 512 MB of memory you need.
After you have read this representation of file 1 into memory. You could scan file 2 line by line and check for each value, if it occurs at least once in file 1 using the array we've created and print out if its a duplicate or not (or write it to another file or into a datastrcture).
the smart version
In case you want to get your hands dirty, continue reading. If you want to use what's out of the JDK box, use a BitSet of size Integer.MAX_VALUE
(thanks @Mzf).
BitSet bs = new BitSet(Integer.MAX_VALUE);
bs.set(valueFromFile1);
boolean duplicate = bs.get(valueFromFile2);
the version for men with beards who run as root:
The structure of the lookup array is like
[ 0000 0001 0000 0000 ... , 0000 0000 1000 000 ..., ...]
^ ^
7 (0*64 + 7) 74 (1*64 + 8)
What you need to have is a conversion from int value to index position and bit-offset.
int pos(int value){
return value / 64;
}
long offsetMask(int value){
return 1L << (value % 64)
}
boolean exists(long[] index, int value) {
return (index[pos(value)] & offsetMask(value)) != 0;
}
long[] index = new long[67108864];
//read references file
Scanner sc = new Scanner(new File("file1"));
while (sc.hasNextInt()) {
int value = sc.nextInt();
index[pos(value)] |= offsetMask(value);
}
//find duplicates
Scanner sc = new Scanner(new File("file2"));
while (sc.hasNextInt()) {
int value = sc.nextInt();
boolean result = exists(index, value)
if(result) {
System.out.println("Duplicate: " + value);
}
}
(it's basically the same what's done in the BitSet)
It doesn't matter if files are larger as longe as the value range does not increase you do not need more than 512 MB.