2

Is it possible to read a text file with a double while/for loop?

I'd like to do something like this:

for( String row1 = 0; row1 < file.length; row1++ ) {

   for( String row2 = row1 + 1; row2 < file.length; row2++ ){

       if( file[row1] == file[row2] ){
            // other code
       }

   }

}

I need a double loop because I have to find a duplicate row in the file with 2.500.000 rows. I can't use a Set to save the rows because the heap size is insufficient and if I try to increase it, I get this error: "Error occurred during initialization of VM Could not reserve enough space for object heap Could not create the Java virtual machine.." (I've got a Windows 7 64 bit and 8 GB Ram)

Thanks in advance

mKorbel
  • 109,525
  • 20
  • 134
  • 319
Webman
  • 1,534
  • 3
  • 26
  • 48

3 Answers3

6

Sort the original file (you can split it up and use merge sort). Then find dups iteratively (if prev == cur, you've found a dup).

Moishe Lettvin
  • 8,462
  • 1
  • 26
  • 40
  • but in this way the heap problem size should remain...or am I wrong? – Webman Oct 25 '11 at 15:46
  • @Webman No, that would solve the heap size issue as you wouldn't be keeping references to the data once it is written to disk. The garbage collector would be able to do its thing. I've added another answer that explains in greater details and has some links to implementation details and pseudocode for you. – Jeff Ferland Oct 25 '11 at 16:25
1

Based upon your question and the comments following it, your goal is to find duplicates in a large file. Worst-case for this is O(N^2) -- comparing every object to every other object. The better solution is to sort them first.

Because the file is too large for you to allocate enough memory to sort it in memory, you need to use a different approach. How could the UNIX sort command sort a very large file? provides some details of an implmentation. The generic problem is "external sorting".

The pseudo-code from the Wikipedia page should be suitably easy to follow and implement. If you're feeling really brave, you can use tackle the algorithmic details from the Unix sort command and the corresponding pages of the Knuth book.

... and finally, some Googled code that I haven't really reviewed or tested:

Community
  • 1
  • 1
Jeff Ferland
  • 17,832
  • 7
  • 46
  • 76
0

You can do that. But the performance is O(n²), which isn't too good. Also, beware of using ==. This checks if the two instances are the same object, it's not the same as using equals. Maybe you can calculate a hash for each row and use that to sniff out possible collisions.

G_H
  • 11,739
  • 3
  • 38
  • 82
  • The performance is not important: i just want to remove the duplicate rows to get a new file. – Webman Oct 25 '11 at 15:35
  • Then I recon Moishe's solution will work well. You can parse over the file, output to two files half the size and keep doing this recursively a couple of times. Then start a merge sort from those smaller files back into larger files. Lots of IO, slow, but memory usage can be kept minimal. – G_H Oct 25 '11 at 15:44