Double while loop to read a text file

Question

Is it possible to read a text file with a double while/for loop?

I'd like to do something like this:

for( String row1 = 0; row1 < file.length; row1++ ) {

   for( String row2 = row1 + 1; row2 < file.length; row2++ ){

       if( file[row1] == file[row2] ){
            // other code
       }

   }

}

I need a double loop because I have to find a duplicate row in the file with 2.500.000 rows. I can't use a Set to save the rows because the heap size is insufficient and if I try to increase it, I get this error: "Error occurred during initialization of VM Could not reserve enough space for object heap Could not create the Java virtual machine.." (I've got a Windows 7 64 bit and 8 GB Ram)

Thanks in advance

This wouldn't be a very good solution, you would have to iterate through all of the rows in the inner loop to see if it matched the row in the outerloop. — Hunter McMillen, Oct 25 '11 at 15:27
Hm, have you installed a 64 bit version of Java? Honestly, I hope you implement an intelligent solution and not that simple brute forcing by memory increase, but this could solve your problem. — Sibbo, Oct 25 '11 at 15:35
@Sibbo I am forced to use the 32-bit version because I'm using the Jpl.rar 32 bit (Swi-Prolog) — Webman, Oct 25 '11 at 15:40
SLaks' database suggestion sounds really good. JavaDB can quickly get you on your way with minimal pain, and delegating the sorting to the DBMS should yield good performance. — G_H, Oct 25 '11 at 15:46

score 6 · Answer 1 · answered Oct 25 '11 at 15:26

6

Sort the original file (you can split it up and use merge sort). Then find dups iteratively (if prev == cur, you've found a dup).

answered Oct 25 '11 at 15:26

Moishe Lettvin

8,462
1
26
40

but in this way the heap problem size should remain...or am I wrong? – Webman Oct 25 '11 at 15:46
@Webman No, that would solve the heap size issue as you wouldn't be keeping references to the data once it is written to disk. The garbage collector would be able to do its thing. I've added another answer that explains in greater details and has some links to implementation details and pseudocode for you. – Jeff Ferland Oct 25 '11 at 16:25

score 1 · Answer 2 · edited May 23 '17 at 12:27

Based upon your question and the comments following it, your goal is to find duplicates in a large file. Worst-case for this is O(N^2) -- comparing every object to every other object. The better solution is to sort them first.

Because the file is too large for you to allocate enough memory to sort it in memory, you need to use a different approach. How could the UNIX sort command sort a very large file? provides some details of an implmentation. The generic problem is "external sorting".

The pseudo-code from the Wikipedia page should be suitably easy to follow and implement. If you're feeling really brave, you can use tackle the algorithmic details from the Unix sort command and the corresponding pages of the Knuth book.

... and finally, some Googled code that I haven't really reviewed or tested:

I don't have enough time to study it :(. I choosed the database way — Webman, Oct 25 '11 at 17:02

score 0 · Answer 3 · answered Oct 25 '11 at 15:26

0

You can do that. But the performance is O(n²), which isn't too good. Also, beware of using ==. This checks if the two instances are the same object, it's not the same as using equals. Maybe you can calculate a hash for each row and use that to sniff out possible collisions.

answered Oct 25 '11 at 15:26

G_H

11,739
3
38
82

The performance is not important: i just want to remove the duplicate rows to get a new file. – Webman Oct 25 '11 at 15:35
Then I recon Moishe's solution will work well. You can parse over the file, output to two files half the size and keep doing this recursively a couple of times. Then start a merge sort from those smaller files back into larger files. Lots of IO, slow, but memory usage can be kept minimal. – G_H Oct 25 '11 at 15:44

Double while loop to read a text file

3 Answers3