Fastest way to find matches in two files in bash

Question

I have two files with SHA1 sums and I'm trying to find the matching rows in them. I tried using grep:

grep -f first.txt second.txt

but that was pretty slow. It got me thinking what is the fastest way to find the matching rows in Bash using scripting or any of the usual shell tools?

Below is a script that produces two files with 10000 rows of SHA1 sums from values 1...10000 and shuffles them (with shuf) while writing the rows to files. So the rows in the two files will be the same but in different orders. It took my shared shell computer 40 seconds to make those 2 files.

for files in first.txt second.txt
do
    for i in {1..10000}
    do dashed=$(echo $i | sha1sum)
       read undashed rest <<< $dashed
       echo $undashed
    done |shuf > $files
done

time grep -f first.txt second.txt

took about a minute to find 12 matching rows so that's about five rows per second. Sorting the files before grepping didn't speed it much mentionably. Somewhere it was suggested to use grep --mmap but that gave me as feedback the following:

grep: the --mmap option has been a no-op since 2010

So, who is up for some testing?

Feel free to fix the script if you'd like and add tags as you come up with ideas. Is 10000 rows enough for testing?

what about using `diff`? Also have a look at [Remove duplicates from text file based on second text file](http://stackoverflow.com/q/30820894/1983854) with a good set of comparisons. — fedorqui, Nov 06 '15 at 11:07

twin · Accepted Answer · 2015-11-06T11:12:48.747

2

First sort the files, then use join:

sort first.txt > firstSorted.txt
sort second.txt > secondSorted.txt
join firstSorted.txt secondSorted.txt

edited Nov 06 '15 at 11:12

answered Nov 06 '15 at 11:07

twin

1,669
13
11

Yeah, after 0.124 s later I'm sure 10000 is not enough for testing. – James Brown Nov 06 '15 at 11:12
so probably `join <(sort first.txt) <(sort second.txt)` can also work. – fedorqui Nov 06 '15 at 12:31

Fastest way to find matches in two files in bash

1 Answers1