I have a general question on your opinion about my "technique".
There are 2 textfiles (file_1
and file_2
) that need to be compared to each other. Both are very huge (3-4 gigabytes, from 30,000,000 to 45,000,000 lines each).
My idea is to read several lines (as many as possible) of file_1
to the memory, then compare those to all lines of file_2
. If there's a match, the lines from both files that match shall be written to a new file. Then go on with the next 1000 lines of file_1
and also compare those to all lines of file_2
until I went through file_1
completely.
But this sounds actually really, really time consuming and complicated to me. Can you think of any other method to compare those two files?
How long do you think the comparison could take? For my program, time does not matter that much. I have no experience in working with such huge files, therefore I have no idea how long this might take. It shouldn't take more than a day though. ;-) But I am afraid my technique could take forever...
Antoher question that just came to my mind: how many lines would you read into the memory? As many as possible? Is there a way to determine the number of possible lines before actually trying it? I want to read as many as possible (because I think that's faster) but I've ran out of memory quite often.
Thanks in advance.
EDIT I think I have to explain my problem a bit more.
The purpose is not to see if the two files in general are identical (they are not).
There are some lines in each file that share the same "characteristic".
Here's an example:
file_1
looks somewhat like this:
mat1 1000 2000 TEXT //this means the range is from 1000 - 2000
mat1 2040 2050 TEXT
mat3 10000 10010 TEXT
mat2 20 500 TEXT
file_2
looks like this:
mat3 10009 TEXT
mat3 200 TEXT
mat1 999 TEXT
TEXT
refers to characters and digits that are of no interest for me, mat
can go from mat1 - mat50
and are in no order; also there can be 1000x mat2
(but the numbers in the next column are different). I need to find the fitting lines in a way that: matX is the same in both compared lines an the number mentioned in file_2
fits into the range mentioned in file_1
.
So in my example I would find one match: line 3 of file_1
and line 1 of file_2
(because both are mat3 and 10009 is between 10000 and 10010).
I hope this makes it clear to you!
So my question is: how would you search for the matching lines?
Yes, I use Java as my programming language.
EDIT I now divided the huge files first so that I have no problems with being out of memory. I also think it is faster to compare (many) smaller files to each other than those two huge files. After that I can compare them the way I mentioned above. It may not be the perfect way, but I am still learning ;-) Nonentheless all your approaches were very helpful to me, thank you for your replies!