1

I have two files with SHA1 sums and I'm trying to find the matching rows in them. I tried using grep:

grep -f first.txt second.txt

but that was pretty slow. It got me thinking what is the fastest way to find the matching rows in Bash using scripting or any of the usual shell tools?

Below is a script that produces two files with 10000 rows of SHA1 sums from values 1...10000 and shuffles them (with shuf) while writing the rows to files. So the rows in the two files will be the same but in different orders. It took my shared shell computer 40 seconds to make those 2 files.

for files in first.txt second.txt
do
    for i in {1..10000}
    do dashed=$(echo $i | sha1sum)
       read undashed rest <<< $dashed
       echo $undashed
    done |shuf > $files
done

time grep -f first.txt second.txt

took about a minute to find 12 matching rows so that's about five rows per second. Sorting the files before grepping didn't speed it much mentionably. Somewhere it was suggested to use grep --mmap but that gave me as feedback the following:

grep: the --mmap option has been a no-op since 2010

So, who is up for some testing?

Feel free to fix the script if you'd like and add tags as you come up with ideas. Is 10000 rows enough for testing?

James Brown
  • 36,089
  • 7
  • 43
  • 59
  • what about using `diff`? Also have a look at [Remove duplicates from text file based on second text file](http://stackoverflow.com/q/30820894/1983854) with a good set of comparisons. – fedorqui Nov 06 '15 at 11:07

1 Answers1

2

First sort the files, then use join:

sort first.txt > firstSorted.txt
sort second.txt > secondSorted.txt
join firstSorted.txt secondSorted.txt
twin
  • 1,669
  • 13
  • 11