How to extract unique lines in file > 10 GB with 4GB RAM

Question

I have a PC with 4 Gigabyte RAM and a file with 10 Gigabyte memory usage. Now I want to check, if each line in the file is unique so I have written the following code:

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;

public class Cleaner {

    public static void main(String[] args) throws IOException {
        if (args.length < 2) {
            System.out.println("Too less parameters!");
            return;
        }

        File file = new File(args[0]);
        BufferedReader buff = new BufferedReader(new FileReader(file));
        String line;
        Set<String> set = new HashSet<String>();
        while ((line = buff.readLine()) != null) {
            set.add(line);
        }
        FileWriter fw = new FileWriter(args[1]);
        for (String s : set) {
            fw.write(s + "\n");
            fw.flush();
        }
        fw.close();
        buff.close();

    }

}

But I get a OutOfMemoryException so my question is:
How should I change my code to get a file where each line is unique?
Thank you for your help in advance.

Split into chunks and compare pairwaise. Or hash each line and just store the hash together with the line. — user, Nov 04 '15 at 12:13
The problem with hashing is that each line is only a hash and how should I chunk the I possibly miss some duplicated lines. — Leonid Glanz, Nov 04 '15 at 12:17
Take a look at RandomAccessFile you could read line 1 from RandomAccessFile 'a' and compare it to all other lines of RandomAccessFile 'b'. After that read line 2 and so on — user, Nov 04 '15 at 12:22
I'd go for `cat bigfile.txt | sort | uniq > uniq.txt`, but of course that's not a Java solution. — Kayaman, Nov 04 '15 at 12:26
I have about 85 million lines and the format is: sha256-hash appended with a file feed. — Leonid Glanz, Nov 04 '15 at 12:26
@LeonidGlanz this may help http://stackoverflow.com/questions/9215820/find-duplicates-in-large-file — wero, Nov 04 '15 at 12:33
You can keep the results in the Set cause the content you're dealing with is quite big. You better off by using another file and append to it rather than using the Set. — MeTitus, Nov 04 '15 at 12:44

score 0 · Answer 1 · answered Nov 04 '15 at 12:17

You can not do that operation in that way because of your RAM memory. Instead, you can read the file and generate n files with a fixed size (f.e: 10.000 lines) read a line and put it in the actual file. When you reach the file limit, open a new one a release all the objects for memory save, then do a second loop and compare each line of the original file using a string (for the line) with the n generated files. Maybe in this way you can avoid the memory gap.

Is a little odd and will be a slow process, but in this way i think you can achieve your requirements.

If you need code, let me know.

Hope helps

You are aware that you don't need to load the whole file in memory to process it aren't you? — MeTitus, Nov 04 '15 at 12:38

score 0 · Accepted Answer · edited Nov 04 '15 at 13:02

0

You could try to look for duplicate line hashes first to identify potential duplicate lines:

Map<Integer, Integer> hashes = new HashMap<> ();
Map<Integer, Integer> dupes = new HashMap<> ();
int i = 0;
while ((line = buff.readLine()) != null) {
  int hash = line.hashCode();
  Integer previous = hashes.get(hash);
  if (previous != null) { //potential duplicate
    dupes.put(i, previous);
  } else {
    hashes.put(hash, i);
  }
  ++i;
}

At the end you have a list of potential duplicates. If dupes is empty there was no duplicate, if it is not then you can do a second pass on the file to check if the lines are really identical.

edited Nov 04 '15 at 13:02

bezmax

25,562
10
53
84

answered Nov 04 '15 at 12:34

assylias

321,522
82
660
783

1

It would also be much more memory-efficient to use Koloboke [`IntIntMap`](http://openhft.github.io/Koloboke/api/0.6/java7/net/openhft/koloboke/collect/map/IntIntMap.html) or Trove [`TIntIntHashMap`](http://trove4j.sourceforge.net/javadocs/gnu/trove/map/hash/TIntIntHashMap.html) to represent the map. – Tagir Valeev Nov 04 '15 at 12:38
About that second pass, you can skip it if you switch to random access enabled files. Then you can just scroll back and check the line on each potential duplicate. Actually, second pass would be impossible without random access anyway. – bezmax Nov 04 '15 at 13:06
@bezmax Random access can't help you go to line xyz - you can only skip a number of bytes - in my example I could store the byte position instead of the line number. – assylias Nov 04 '15 at 13:35
@assylias That's what I implied. I do not believe there is a way to make it work with line numbers as in your example. During the second pass you would have to compare the duplicates between themselves, so you would need to load them all to memory or use random access. Loading to memory would not work either - for the worst case scenario (all lines duplicated) you would need to load full file to memory. – bezmax Nov 04 '15 at 13:46

score -1 · Answer 3 · answered Nov 04 '15 at 12:37

You can cheat with something like this: (example is Groovy, but the equivalent Java would work)

def hashes = []
def writer = new PrintWriter(new FileWriter("out.txt"))
new File('test.txt').eachLine { line ->
    def hashCode = DigestUtils.sha256Hex(line) //Commons digest library
    if (!(hashCode in hashes)) {
        hashes << hashCode
        writer.println(line)
    }
}
writer.close()

That shouldn't require more than about 1GB of RAM to run. SHA256 hashes will give you probably a lot more certainty over the uniqueness of a line than the standard hashCode method.

It is supposed to detect hash collisions because he wanted a file that only has unique lines. Let me guess, you're the guy who downvoted this aren't you... — Mike Thomsen, Nov 05 '15 at 11:41

How to extract unique lines in file > 10 GB with 4GB RAM

3 Answers3