Fastest way to delete duplicates in large wordlist?

Question

A similar question was made here but they didn't address why there is a speed difference between sort and awk.

I made this question first on Unix Stackexchange but since they told me this would be a good question for Stackoverflow I'll post it here.

I need to deduplicate a large wordlist. I tried several commands and did some research here and here where they explained that the fastest way to deduplicate a wordlist seems to be using awk because awk doesn't sort the list. It uses hash lookups to keep track of the items and delete duplicates. Since AWK uses hash lookup they argued that that big O is like this

awk --> O(n) ?
sort --> O(n log n) ?

However I found that this isn't true. Here are my testing results. I generated two random wordlists using this python script.

List1 = 7 Mb
List2 = 690 Mb

Test commands

sort -u input.txt -o output.txt 

awk '!x[$0]++' input.txt > output.txt

Results AWK:
List1
real 0m1.643s
user 0m1.565s
sys 0m0.062s

List2
real 2m6.918s
user 2m4.499s
sys 0m1.345s

Results SORT:
List1
real 0m0.724s
user 0m0.666s
sys 0m0.048s

List2
real 1m27.254s
user 1m25.013s
sys 0m1.251s

I made these tests over and over again and found consistent results. Namely, that SORT is a lot faster. Could someone explain why and if there is an even faster way to do it?

************ Update ***********
Things that could have flawed my outcomes are

caching: I've excluded this possibility by changing the order of execution on the tests
constant factors of the big O notation. I think they should've become irrelevant at this point due to the size of the wordlists. (600Mb)
Bad implementation of algorithms: This remains a possibility I haven't checked out the source code of awk and sort

What is `n` in each case? If you try this for many different values of `n`, do you see `awk`'s running time increasing linearly, or `sort`'s increasing as `n lg n`? — chepner, Sep 01 '15 at 17:10
There may be something special about your lists. For me, `awk` wins as I expect. Try `shuf /usr/share/dict/words > unsorted; time sort -u unsorted -o sort-sorted; time awk '!x[$0]++' unsorted > awk-sorted` — bishop, Sep 01 '15 at 17:16
@bishop I've tried with 15+ Gb wordlists, with the same result. The lists I used for this test are randomly generated with python.. — Karl, Sep 01 '15 at 17:31
@chepner both seem to behave pretty much linearly, comparing the two wordlists — Karl, Sep 01 '15 at 17:34
Your test data has a *lot* of duplicates (expected 99%), which `sort -u` can process much more efficiently than the worst-case scenario. Testing with 1,000,000 random values from 500,000 distinct values (instead of 100,000,000 random from 1,000,000 distinct) has `awk` being almost 14 times faster. — chepner, Sep 01 '15 at 17:59
This has been asked so many times. Some additional strategies here: http://stackoverflow.com/q/30906191/1172700 — ghoti, Sep 01 '15 at 18:10
@chepner Yep, I suspect that's what's going on here. That said, karlpy, `awk` will be memory constrained, as the entire set must be kept *in situ*. If free() < sizeof(wordlist), use something other than awk (like a Bloom filter). — bishop, Sep 01 '15 at 18:41
Interesting point @chepner, I've replicated your experiment with 10.000.000 sample size and 500.000 distinct values. But sort is still 30% faster --> sort = 0m10.258s -- awk = 0m16.966s could this be realated to the OS and hardware? I'm on a Macbook pro 2015.. 2.7 GHz Intel Core i5 – — Karl, Sep 01 '15 at 19:02

score 3 · Answer 1 · answered Sep 01 '15 at 18:02

3

Your sample input has a lot of duplicate values; you only have 1,000,000 distinct values in a sample size of 100,000,000, so you would expect only 1% of the values to be unique. I don't know exactly how sort -u works, but imagine it is a merge sort which filters unique values during each merge. The effective input size would then be much smaller than 100,000,000. Rerunning your commands with only 1,000,000 values, but chosen from 500,000 distinct values (so that 50%, not 1%, are expected to be unique) produces the following results:

% time awk '!x[$0]++' randomwordlist.txt > /dev/null
awk ...  1.32s user 0.02s system 99% cpu 1.338 total
% time sort -u randomwordlist.txt -o /dev/null
sort ...  14.25s user 0.04s system 99% cpu 14.304 total

answered Sep 01 '15 at 18:02

chepner

497,756
71
530
681

Interesting point, I've replicated your experiment with 10.000.000 sample size and 5.000.000 distinct values. But sort is still 30% faster --> sort = 0m10.258s -- awk = 0m16.966s – Karl Sep 01 '15 at 18:58
could this be realated to the OS and hardware? I'm on a Macbook pro 2015.. 2.7 GHz Intel Core i5 – Karl Sep 01 '15 at 19:01
sorry I mean 5.000.000 – Karl Sep 01 '15 at 19:03
You're probably hitting memory limits that produce swap-related overhead. – chepner Sep 01 '15 at 19:03
Also i've tried with another large wordlist with 50% unique values.. same results there. So this seems like an implementation issue to me.. – Karl Sep 01 '15 at 19:04
Just to clarify, I replicated your test with 1M-500.000 words and also tried with 10M-5M words. In both sort is faster, so I don't think it is swap-related – Karl Sep 01 '15 at 19:21

score 1 · Answer 2 · answered Sep 01 '15 at 17:29

The big-O notation only tells you that there is some N for which O(N) will be faster than O(N*log N). The actual number of operations include constant factors and added terms so that in reality the numbers are
O(N) ~ k1 * N + c1 and
O(N * log N) ~ k2 * N * log(N) + c2
Which one is faster for a chosen N depends on the values of the k and c.
Some input/algorithm combinations lead to very small k and c.
Either program may not use the optimum algorithm.
Caching efffects? If you always run test 1 before test 2, the second test may use already cached data, while the first always has to load from scratch. Proper elimination/determination of cache effects is an art.
Something else I haven't thought of and others will be quick to point out :-)

I thought with the big difference in size of the wordlists, k and c should become irrelevant.. (I did this with a 20 Gb wordlist too) I've excluded the caching effect changing the order of execution on this experiment — Karl, Sep 01 '15 at 17:36

Fastest way to delete duplicates in large wordlist?

2 Answers2