Removing duplicate strings from huge text files

Question

I want to remove the duplicate strings from a text file. In order to do that I put every single line in a HashSet and then write them into another file. And it works fine. But when it comes to big files (180mb 5 million lines) it doesn't work very well. Assuming the fact that it is not possible to store 5 million strings in a HashSet or any other collection, I made a loop so I store the first 100 000 lines, then write them to file, then clear the HashSet and to it again until there is no more lines in the file. Unfortunately, this won't remove all the duplicates but I think it can remove about 70-90% of them. But it doesn't work. When I test it with the 180mb file with 5 million lines. I count about 300 000 duplicates and the new file has about 3 million lines. It should have about 5 million - 300 000. And when I count the iterations they should be 5 million, but they are 3,4 million.

    public File removeDuplicates(File file) {
    System.out.println("file opened");
    Scanner sc;
    HashSet<String> set = new HashSet<String>();
    JFileChooser chooser = new JFileChooser();
    File createdFile = null;
    int returnVal = chooser.showSaveDialog(parent);
    if (returnVal == JFileChooser.APPROVE_OPTION) {
        BufferedWriter bufferedWriter = null;
        createdFile = chooser.getSelectedFile();
        try {           

            if (!createdFile.exists()) {
                createdFile.createNewFile();
            }
        }catch(Exception e) {
            e.printStackTrace();
        }
    }
    try {
        sc = new Scanner(file);
        boolean hasMore = true;
        while (hasMore) {
            hasMore = false;
            while (sc.hasNextLine() && set.size() < PERIOD) {
                set.add(sc.nextLine());
                repeated++;
            }
            createdFile = this.writeToFile(set,createdFile);
            set.clear();
            hasMore = true;
            if (sc.hasNextLine() == false)
                hasMore = false;
            set.clear();
        }
    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return createdFile;

}
private File writeToFile(HashSet<String> set, File f) {
        BufferedWriter bufferedWriter = null;
        try {           
            Writer writer = new FileWriter(f, true);
            bufferedWriter = new BufferedWriter(writer);
            for (String str : set) {
                bufferedWriter.write(str);
                bufferedWriter.newLine();
            }
        } catch (Exception e) {
            e.printStackTrace();
        }finally {
            if (bufferedWriter != null)
                try {
                    bufferedWriter.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
        }


    return f;
}

repeated is the variable that counts the iterations. Is it something from the code or is it from the RAM consumption? And is there any way to make it work?

What does _But when it comes to big files (180mb 5 million lines) it doesn't work very well_ mean? What is the specific problem? — Andrew S, Sep 13 '18 at 13:49
If you have enough RAM, there shouldn't be any problem with 5 million lines or 180 mb... Besides, you aren't telling whether each line contains one word or many words — fps, Sep 13 '18 at 13:51
With the small files it works perfectly. With 100 000 lines file for example. — Dexxrey, Sep 13 '18 at 13:57
Clearly, if you call `set.clear()` and then keep reading lines, you have no way to know if the subsequent lines are duplicates of lines that were in `set` before you cleared it. — VGR, Sep 13 '18 at 14:00
Let's call the 180 mb file A.txt and the result B.txt. When I run it with the file A as a result it creates the file B with 3 million records in it. And because there is no other way to check if it is working i counted the times that the set size didn't change and this is about 300k times. So the lines of file A should be equal to the lines of file B + 300k but it is not even close to that. Then I started to count the iterations. And with the file A the iterations should be equal to the lines of the file A but it is about 3,4 million iterations not 5 million as it should be. — Dexxrey, Sep 13 '18 at 14:03
`Assuming the fact that it is not possible to store 5 million strings in a HashSet or any other collection` wait, where does that assumption come from? [It seems to be wrong](https://stackoverflow.com/questions/7632126/maximum-size-of-hashset-vector-linkedlist). You are more likely limited by memory size (and JVM memory limits) than the actual set implementation. Even then, there are likely to be implementations that handle even more elements...as long as you can spare the memory. — VLAZ, Sep 13 '18 at 14:04
With that said, it could be relatively simple to drastically reduce memory consumption - read line from file1, check if it's in file2, if not -> write it to file2. That will do a lot more disk reads, though but it will avoid gobbling up the RAM if you can't spare it. — VLAZ, Sep 13 '18 at 14:06
Yeah but I can't store 5 million strings with average length of 30 in a HashSet. I tried but when I run it hangs and does nothing. — Dexxrey, Sep 13 '18 at 14:09
According to answer to the other question, you should be able to store more items. By a factor of about 400. Maybe even more. To be honest, I've not tried it but it seems sound to me that a Set should be able to accommodate 5 million elements. Again, unless memory is the limit. As per the accepted answer in the other question, the documentation for `Collection` does mention that it's possible for a size to be `Integer.MAX_VALUE` or even *more*. That could probably vary for different implementations but it still means that your assumption doesn't seem correct. — VLAZ, Sep 13 '18 at 14:17
Before we get hung up on the details of HashSets - is this an XY Problem? (https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem/66378#66378) - i.e., do you want to deduplicate a text file by anymeans, or do you have to write this Java program? Because either way, I have some solutions for ya buddy. — Benjamin Maurer, Sep 13 '18 at 14:20
180mb really isn't that much even when talking about virtual memory. I suggest you debug the reason why your script hangs with the full file (you will probably want to check the GC logs first to see if it's running full-time, and in that case increase your JVM's XMX). You could always sacrifice CPU to win memory by storing hashes of the lines rather than their whole content, but I doubt you will want to sacrifice the time it would take to read an entire file for each record as @vlaz suggested at one point. — Aaron, Sep 13 '18 at 14:22
I don't know the Java HashSet implementation that well, but usually that works by using a fast hash function and then adding elements to buckets on collision, i.e., there may be multiple elements per bucket if there are collisions, in which case you will degrade to a linear search. That's usually no problem, but if you have 5 million strings and many collisions, you might degrade to a linear search over 1 million strings, comparing each for equality. — Benjamin Maurer, Sep 13 '18 at 14:26

Benjamin Maurer · Accepted Answer · 2018-09-14T13:21:24.800

De-duplicate

Let's assume for a moment, that you simply want to de-duplicate that file. I'd say the fastest, no-hassle method would be good old unix utils:

cat myfile.txt | sort -u > sorted.txt

Improving your Solution

(TL;DR Increase JVM Heap Size, initialize HashSet size and use the last solution in this answer!)

In case you need to do this in Java, let's first try to make this more efficient. Like many people have mentioned, 180MB isn't all that much. Just load the whole file, no need to chunk it (plus then you will not eliminate all duplicates). Take this line for example:

HashSet<String> set = new HashSet<String>();

This will create a HashSet with an initial capacity of n (I think 16 Elements?) and a load factor of 0.75, meaning that as you add lines, it will have to re-allocate memory and copy everything over. Here is something useful to read, especially "Performance"

So let's increase that size to avoid allocation:

Set<String> set = new HashSet<String>(5000000);

I left the load factor as is, but that means it will reallocate once it's 75% full. If you know the size of your file for sure, you can adjust those settings.

Alright, I had to learn it the hard way - always measure first! That is rule number one of performance work. I wrote all that and then tested my own implementation on my fast workstation (with 16GB RAM and a fast multi-core CPU) and summed all that up in my edit. Now I was curious to try your solution (which I should have done right away). So I re-ran it on my notebook at home (8GB RAM, 4+ year old CPU).

Alright, here is the simplified code:

import java.io.*;
import java.util.*;

public class SortTest {

    public static void main(String[] args) throws IOException {
        if (args.length != 1) {
            System.err.println("Pass filename as argument!");
            System.exit(1);
        }

        Set<String> set = new HashSet<String>();
        File createdFile = new File("./outfile");
        createdFile.createNewFile();

        try (BufferedReader br = new BufferedReader(new FileReader(new File(args[0])))) {
            for (String line = br.readLine(); line != null; line = br.readLine()) {
                set.add(line);
            }
        } catch (IOException ex) {
            throw new RuntimeException("Fatal Error.",  ex);
        }

        try (BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(createdFile, true))) {
            for (String line : set) {
                bufferedWriter.write(line);
                bufferedWriter.newLine();
            }
        }
    }
}

Changes: I removed the chunking, loading the whole file at once. I'm using a BufferedReader, bc. a Scanner is more useful for parsing (read integers etc.) and might incur overhead. I also added the writing of the file to the end and I don't need to recreate the BufferedWriter every time. Also note that File.createNewFile() will only create a file if it doesn't exist and return whether it did, so your check is superfluous. (Note that I have omitted proper error handling for brevity)

I used name.basics from https://datasets.imdbws.com/ That is a 509MB file (unzipped), containing 8.837.960 lines. Those are actually unique, so the end result is the same.

It actually consumed a heck of a lot of resources and my system becomes rather slow. At first, I even got an OutOfMemory Error! But running it with more Heap space worked: time java -Xmx4g SortTest ./name.basics.tsv Gives me:

real 0m44.289s

user 1m23.128s

sys 0m2.856s

So around 44 seconds, not bad. Now let's avoid the allocations and set:

Set<String> set = new HashSet<String>(9000000, 0.9f);

Result:

real 0m38.443s

user 1m12.140s

sys 0m2.376s

Well, that looks better. I have to say though, that I reran those tests multiple times and the times can vary up to 5 seconds, so in reality, the results are very close.

Just for fun, I'll also show my own little implementation, which uses more modern and succinct Java (again, no proper error handling):

import java.nio.file.*;
import java.util.*;

public class SortTest2 {

    public static void main(String[] args) throws Exception {
        Set<String> uniq = new HashSet<>(100000, 0.9f);
        try (Stream<String> stream = Files.lines(Paths.get(args[0]))) {
            stream.forEach(uniq::add);
        }

        Files.write(Paths.get("./outfile2"), (Iterable<String>) uniq::iterator);
    }
}

Results:

real 0m38.321s

user 1m16.452s

sys 0m2.828s

Less code, but the result is pretty much the same. Note: if you replace the HashSet with a LinkedHashSet, it will preserve the ordering of your lines! This is a good example why you should declare your variables and arguments with the most generic type possible. If you use Set<String> uniq, you have to change only that line to change implementation (HashSet vs. LinkedHashSet).

I actually wanted to have a look at it with a profiler, but the run time was so short, I didn't even get results before the program terminated.

If the file fits in your RAM and you use the appropriate max heap argument (-Xmx), it shouldn't be a problem.

By the way: I re-tested the cat | sort -u version - it took 55 seconds!

Note: heavily edited post after more testing

EDIT

Following user DodgyCodeException's suggestion and removed superfluous .stream() call in the second version.

OK, this is the best solution™ - I would say it was a collaborative effort, thanks to users Hulk and vlaz.

import java.nio.file.*;
import java.util.stream.*;

public class SortTest3 {

    public static void main(String[] args) throws Exception {
        try (Stream<String> stream = Files.lines(Paths.get(args[0]))) {
            Files.write(Paths.get("./outfile3"), (Iterable<String>) stream.distinct()::iterator);
        }
    }
}

Not only is this solution very succinct (possibly too much so), as fast as the other one, but best of all it preserves order. All thanks to .distinct().

Alternative Solutions

I think the above solution should suffice for most use-cases and is rather simple. But let's say you need to deal with a file, which does not fit into RAM, or you need to preserve line ordering. We can take the idea behind this solution and change it up a bit.

You read the file, line by line, so you'll always have one line in memory - let's say of average length m. You then need some identifier to store and compare later, preferably with a constant size k and k << m. So you need a hash function, but not a fast one with many collisions, but a cryptographic hash function, which is more collision resistant (e.g., SHA1, 2 or 3). But note: the more collision resistant, the larger the hash and the larger the computational work you need to put in.

Read line
Calculate hash
Look for value in linked list:
- if you find one larger, insert before
- if you find one equal, discard line
Write line to output file if not discarded

You'll need a linked list to keep insertion cheap (and that list has to grow). The list will be kept ordered by the insertion strategy and the output file will preserve order by writing the lines out immediately.

This would take up approximately n * k + m in space, but it calculating the hash function will be computationally expensive.

Note that this does not deal with collisions. If you use a good hash function, you can just pretend they won't happen (as they are very unlikely). If it is critical, you might need add another mechanism to confirm uniqueness, e.g., store the line number alongside the hash and fetch the previously seen line for a comparison. You'll then need to find a scheme to store the lines with collided hashes.

`If you know the size of your file for sure, you can adjust those settings.` it should be possible to count the lines. You might need to go through the file once (discarding any text), just to see how many lines but it would pay off by being able to construct a more efficient Set. I'm not sure if there is an even easier way to count all lines. — VLAZ, Sep 13 '18 at 14:55
I don't think that makes sense. That way you would have to read the file twice. A memory reallocation would be cheaper. If he knows the average line count, that'd be good enough. You have to keep in mind, that on each allocation, the size doubles. If you start out with 16 and have to go to 5m, that's still a lot of steps. Starting at 5m and needing 8m would only take on reallocation. — Benjamin Maurer, Sep 13 '18 at 14:58
I think I should increase JVM's XMX. I tried to load all the strings and the counter counts 3410156 iterations. They should be 5 million. So I have to increase the maximum amount of RAM to make that happen. Because my program can't get more than 1,1 GB of RAM. — Dexxrey, Sep 13 '18 at 15:04
Which counter? And aren't you jumping to conclusions here? Why do you think that is the cause? If you ran out of memory, you would get an OutOfMemoryError and everything would come crashing down. Running out of memory isn't a silent fizzle-out, it goes with a bang. Believe you me — Benjamin Maurer, Sep 13 '18 at 15:16
nice answer! just observation: log2(5M) ~ 22.2, so not that many scaling ups. — diginoise, Sep 13 '18 at 15:43
When I try your code it throws exception lines.forEach(line -> uniq.add(line)); on this line. — Dexxrey, Sep 13 '18 at 20:24
`stream.forEach(uniq::add);` actually, I was wondering what would happen if you were to do `Files.lines(path).distinct()` and directly write the result. I imagine that under the hood it's probably going to do something similar but I am interested to see how the performance would compare against this solution. I'll probably test it myself later, as I am not able to right now. — VLAZ, Sep 14 '18 at 06:18
Any reason for `uniq.stream()::iterator` rather than just `uniq::iterator`? Also, is the OP OK with re-ordering the lines (which is an unintended side-effect of using a HashSet) rather than just de-duplicating them? — DodgyCodeException, Sep 14 '18 at 08:41
@Hulk, nice one, added that solution. @DodgyCodeException: Fixed it, thx. Well, OP came up with the HashSet solution, so I guess he is OK with it. Most solutions will destroy order though, like the one with `sort -u` or `sort | uniq`. Uniq compares adjacent lines, so it needs sorted input. It's a good point though and I have added a suggestion for an order preserving solution. — Benjamin Maurer, Sep 14 '18 at 09:42
@DodgyCodeException well...to play the spec game - there is no requirement to PRESERVE the order :P But a good spot. Still, OP is currently using a `Set`, hence I assumed that's fine but it's possible that OP hasn't actually considered it, either. — VLAZ, Sep 14 '18 at 11:04
@BenjaminMaurer you could actually preserve the order if you read a line, add it to a Set and try to write it immediately (or add it to some sort of buffer to save disk accesses). If you encounter a duplicate line, then skip it. This way you remove duplicates and still have the same order. — VLAZ, Sep 14 '18 at 11:07
OK, so I tried your `SortTest2` solution and a [using .distinct()](https://pastebin.com/mjCzhFE5) - the results are about equal on my system - ~16s in each case. — VLAZ, Sep 14 '18 at 11:44
Also, I think it was interesting but on the same system I tried `cat | sort -u > outfile` and `cat | sort | uniq > outfile` and the latter was consistently faster. `sort -u` takes roughly 20s and very rarely 19 (and more than .5s), while `sort | uniq` takes approximately 17s. It seems to be slightly slower than the Java program but at least comparable, while `sort -u` is (at least) 25% slower. — VLAZ, Sep 14 '18 at 11:58
@vlaz, really good! 1) Set.add() tells you whether it added anything, so you could write the line if it did. 2) Interesting, I didn't know about .distinct(), still learning the Streams API 3) Interesting observation. Just a side note, AFAIK sort is doing some clever stuff and while in this case slightly slower, should work on much larger inputs. — Benjamin Maurer, Sep 14 '18 at 12:34
It's also possible that `sort` and/or `uniq` use a different implementation on our systems. I just found it interesting thing to note. I wouldn't rely on the behavior I observed being uniform across any *NIX system. Also, ran this on [Linux running inside Windows](https://en.wikipedia.org/wiki/Windows_Subsystem_for_Linux) (it's the only Linux system I have available ATM), so there could be some things behaving differently than what it normally would be. — VLAZ, Sep 14 '18 at 12:43
OK guys, we did it ^^ I added the version with .distinct() - that actually preserves order. I'm curious to see the actual implementation, but the Stream code is mostly a setup for states and I assume the actual work is done in the terminal operations. I still have to look for that, but I should really stop working on this ^^ — Benjamin Maurer, Sep 14 '18 at 13:05
OK, without digging too deep into it, it seems like distinct() uses a LinkedHashSet: http://hg.openjdk.java.net/jdk9/dev/jdk/file/48dac9cf76fb/src/java.base/share/classes/java/util/stream/DistinctOps.java — Benjamin Maurer, Sep 14 '18 at 13:16
Just verified it - switching out the HashSet for a LinkedHashSet achieves the same! — Benjamin Maurer, Sep 14 '18 at 13:18

Removing duplicate strings from huge text files

1 Answers1