Find duplicates in large file

Question

I have really large file with approximately 15 million entries. Each line in the file contains a single string (call it key).

I need to find the duplicate entries in the file using java. I tried to use a hashmap and detect duplicate entries. Apparently that approach is throwing me a "java.lang.OutOfMemoryError: Java heap space" error.

How can I solve this problem?

I think I could increase the heap space and try it, but I wanted to know if there are better efficient solutions without having to tweak the heap space.

Offtopic: How did you get 15 million entries in the first place? — Mob, Feb 09 '12 at 17:38
The good way of working should be to not have duplicates. There shouldn't be a need for removing duplicates. — Martijn Courteaux, Feb 09 '12 at 17:39
@Martijn Courteaux: You don't know what kind of data this is. For example, if you have a book and want to know which words are used in the book, there is no way to avoid duplicates like `the` in the first place. — DarkDust, Feb 09 '12 at 17:45
@Martijn Courteaux - where do you work? Do you always get to demand that all inputs to the system are in a format that makes your life easier? I want to work there! — Peter Recore, Feb 09 '12 at 19:44
@PeterRecore: I'm not working yet. I'm going to secondary school... :( But, yes. I understand. My cousin works in the IT sector the stories he tells are quite disappointing. There are a lot of noobs working in important companies. etc etc. — Martijn Courteaux, Feb 09 '12 at 20:33
@Martijn Courteaux - ahh, so you are still young and optimistic :) It is not just "noobs" that are the problem though. The real world is messy. Part of our job is to overcome the messiness and produce something useful. Imagine if Google only indexed web pages with properly spelled English words and perfect grammar. Or if Ford manufactured a car that could only work in sunny weather on brand new roads. — Peter Recore, Feb 10 '12 at 16:46

score 36 · Accepted Answer · answered Feb 09 '12 at 17:38

36

The key is that your data will not fit into memory. You can use external merge sort for this:

Partition your file into multiple smaller chunks that fit into memory. Sort each chunk, eliminate the duplicates (now neighboring elements).

Merge the chunks and again eliminate the duplicates when merging. Since you will have an n-nway merge here you can keep the next k elements from each chunk in memory, once the items for a chunk are depleted (they have been merged already) grab more from disk.

answered Feb 09 '12 at 17:38

BrokenGlass

158,293
28
286
335

Instead of fixed-size batches, keep reading until you've seen enough unique lines to grow your dictionary to a certain capacity, then write that out as a sorted batch for external merging. See my answer on http://stackoverflow.com/questions/32535222/memory-constrained-external-sorting-of-strings-with-duplicates-combinedcounted/32537772#32537772. Wanting only the duplicates means you can optimize the merging stage a lot, as soon as you've merged the number of batches down to the point where you can see *all* the sorted batches in one wide enough merge. – Peter Cordes Sep 14 '15 at 15:53

score 11 · Answer 2 · answered Feb 09 '12 at 19:12

11

I'm not sure if you'd consider doing this outside of java, but if so, this is very simple in a shell:

cat file | sort | uniq

answered Feb 09 '12 at 19:12

Michael

6,141
2
20
21

@augurar: the OP asked about finding which entries are duplicated, not uniquified output. `sort file | uniq --repeated` (aka `uniq -d`). Michael: Never write `cat file | something`, that's just silly and a waste of CPU time and memory bandwidth compared to `something < file` (or `something file`, if that will give identical results). – Peter Cordes Sep 14 '15 at 15:58
@PeterCordes True, my comment just pertained to this answer. – augurar Sep 14 '15 at 20:12

score 7 · Answer 3 · answered Feb 09 '12 at 17:40

7

You probably can't load the entire file at one time but you can store the hash and line-number in a HashSet no problem.

Pseudo code...

for line in file
    entries.put(line.hashCode, line-number)
for entry in entries
    if entry.lineNumbers > 1
         fetch each line by line number and compare

answered Feb 09 '12 at 17:40

Andrew White

52,720
19
113
137

1

Or store a dictionary of MD5 or SHA1 hashes of lines, and assume there won't be collisions for non-identical lines. When the count for that hash goes from 1 to 2, print the input line you just hashed. The output will be one copy of every line that was duplicated. If you do need to store line numbers for something, store byte offsets instead. Text files can't be random-accessed by line number, because they're variable length and there's no map. – Peter Cordes Sep 14 '15 at 16:02
If storing something for hashes of the keys, I' choose something to allow fast access to the key, like an offset for `RandomAccessFile.seek()`. – greybeard Sep 02 '22 at 05:38

score 4 · Answer 4 · answered Feb 09 '12 at 18:26

4

I don't think you need to sort the data to eliminate duplicates. Just use quicksort inspired approach.

Pick k pivots from the data (unless your data is really wacky this should be pretty straightforward )
Using these k pivots divide the data into k+1 small files
If any of these chunks are too large to fit in memory repeat the process just for that chunk
Once you have manageable sized chunks just apply your favorite method (hashing?) to find duplicates

Note that k can be equal to 1.

answered Feb 09 '12 at 18:26

ElKamina

7,747
28
43

So steps 1 & 2 are really: pick k elements and sort them. Read through the file: for each line, binary search your pivot array, and write the line into bucket `i`, where `pivot[i-1] < line < pivot[i]`. It'd be a lot easier to use the first character or two as a radix to scatter the input into buckets, instead of searching a list of pivots, if your strings have a fairly uniform distribution of first character. – Peter Cordes Sep 14 '15 at 16:12

score 3 · Answer 5 · answered Feb 09 '12 at 17:41

3

One way I can imagine solving this is to first use an external sorting algorithm to sort the file (searching for external sort java yields lots of results with code). Then you can iterate the file line by line, duplicates will now obviously be directly following each other so you only need to remember the previous line while iterating.

answered Feb 09 '12 at 17:41

DarkDust

90,870
19
190
224

what if Duplicates are not in adjacent lines? – hellodear May 12 '15 at 19:32
2

@hellodear: The point of _sorting_ here is to guarantee that duplicates _are_ in adjacent lines. – DarkDust May 13 '15 at 07:07

score 2 · Answer 6 · answered Feb 09 '12 at 17:42

If you cannot build up a complete list since you don't have enough memory, you might try do it in loops. I.e. create a hashmap but only store a small portion of the items (for example, those starting with A). Then you gather the duplicates, then continue with 'B' etc.

Of course you can select any kind of 'grouping' (i.e. first 3 characters, first 6 etc).

It only will take (many) more iterations.

score 2 · Answer 7 · answered Feb 09 '12 at 22:38

2

You might try a Bloom filter, if you're willing to accept a certain amount of statistical error. Guava provides one, but there's a pretty major bug in it right now that should be fixed probably next week with release 11.0.2.

answered Feb 09 '12 at 22:38

Louis Wasserman

191,574
25
345
413

That would be my answer too. The false positives could be eliminated in a second phase (the size of the candidate list would be much smaller) – Victor P. Nov 02 '15 at 21:01

score 0 · Answer 8 · answered Sep 02 '22 at 01:57

#!/bin/bash

# This script will sort a file and remove duplicates
# It will use external merge sort to do this
# It will use a temporary directory to store the sorted chunks
# It will use a temporary directory to store the merged chunks
# It will use a temporary directory to store the final sorted file

# The script will take the following parameters
# $1 - The file to sort
# $2 - The number of lines to sort in each chunk
# $3 - The number of chunks to merge at a time
# $4 - The temporary directory to use

# The script will output the sorted file to stdout

# The script will return 0 on success
# The script will return 1 if the file does not exist
# The script will return 2 if the temporary directory does not exist

# Check that the file exists
if [ ! -f "$1" ]; then
    echo "The file $1 does not exist"
    exit 1
fi

# Check that the temporary directory exists
if [ ! -d "$4" ]; then
    echo "The temporary directory $4 does not exist"
    exit 2
fi

# Create a temporary directory to store the sorted chunks
chunk_dir="$4/chunks"
mkdir -p "$chunk_dir"

# Create a temporary directory to store the merged chunks
merge_dir="$4/merge"
mkdir -p "$merge_dir"

# Create a temporary directory to store the final sorted file
sort_dir="$4/sort"
mkdir -p "$sort_dir"

# Split the file into chunks
split -l "$2" "$1" "$chunk_dir/chunk"

# Sort each chunk
for chunk in "$chunk_dir"/chunk*; do
    sort "$chunk" > "$chunk.sorted"
done

# Merge the chunks
while [ $(ls "$chunk_dir" | wc -l) -gt 0 ]; do
    # Merge the first $3 chunks
    merge_chunks=""
    for i in $(seq 1 "$3"); do
        chunk=$(ls "$chunk_dir" | head -n 1)
        merge_chunks="$merge_chunks $chunk_dir/$chunk"
        rm "$chunk_dir/$chunk"
    done
    merge_file="$merge_dir/merge$(date +%s%N)"
    sort -m "$merge_chunks" > "$merge_file"
    # Remove duplicates from the merged file
    uniq "$merge_file" > "$merge_file.uniq"
    mv "$merge_file.uniq" "$merge_file"
    # Move the merged file to the chunk directory
    mv "$merge_file" "$chunk_dir"
done

# Move the final sorted file to the sort directory
mv "$chunk_dir"/* "$sort_dir"

# Output the sorted file to stdout
cat "$sort_dir"/*

# Remove the temporary directories
rm -rf "$chunk_dir"
rm -rf "$merge_dir"
rm -rf "$sort_dir"

What question does this answer, and how? Does it improve on [Michael's response](https://stackoverflow.com/a/9217211/3789665), which does nothing to answer *How do I avoid `java.lang.OutOfMemoryError`*, too? — greybeard, Sep 02 '22 at 05:24
Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, Sep 07 '22 at 03:01

score 0 · Answer 9 · answered Sep 02 '22 at 06:12

The key is that your data will not fit into memory. (BrokenGlass)
With enough memory to store a Map of key hash values to something to locate the key like an offset for RandomAccessFile.seek() or a line number as Andrew White suggested, you can process non-unique keys as they are identified.

Otherwise, establish a map of hash values to maybe seen before (e.g. a 3MB bitmap indexed with key.hashCode() % (3<<23)) in a first pass and in a second pass handle keys from buckets hit at least twice, only.

Find duplicates in large file

9 Answers9

Linked