A similar question was made here but they didn't address why there is a speed difference between sort and awk.
I made this question first on Unix Stackexchange but since they told me this would be a good question for Stackoverflow I'll post it here.
I need to deduplicate a large wordlist. I tried several commands and did some research here and here where they explained that the fastest way to deduplicate a wordlist seems to be using awk because awk doesn't sort the list. It uses hash lookups to keep track of the items and delete duplicates. Since AWK uses hash lookup they argued that that big O is like this
awk --> O(n) ?
sort --> O(n log n) ?
However I found that this isn't true. Here are my testing results. I generated two random wordlists using this python script.
List1 = 7 Mb
List2 = 690 Mb
Test commands
sort -u input.txt -o output.txt
awk '!x[$0]++' input.txt > output.txt
Results AWK:
List1
real 0m1.643s
user 0m1.565s
sys 0m0.062s
List2
real 2m6.918s
user 2m4.499s
sys 0m1.345s
Results SORT:
List1
real 0m0.724s
user 0m0.666s
sys 0m0.048s
List2
real 1m27.254s
user 1m25.013s
sys 0m1.251s
I made these tests over and over again and found consistent results. Namely, that SORT is a lot faster. Could someone explain why and if there is an even faster way to do it?
************ Update ***********
Things that could have flawed my outcomes are
- caching: I've excluded this possibility by changing the order of execution on the tests
- constant factors of the big O notation. I think they should've become irrelevant at this point due to the size of the wordlists. (600Mb)
- Bad implementation of algorithms: This remains a possibility I haven't checked out the source code of awk and sort