35

What is the fastest method for searching lines in a file containing a string. I have a file containing strings to search. This small file (smallF) contains about 50,000 lines and looks like:

stringToSearch1
stringToSearch2
stringToSearch3

I have to search all of these strings in a larger file (about 100 million lines). If any line in this larger file contains the search string the line is printed.

The best method I have come up with so far is

grep -F -f smallF largeF

But this is not very fast. With just 100 search strings in smallF it takes about 4 minutes. For over 50,000 search strings it will take a lot of time.

Is there a more efficient method?

codeforester
  • 39,467
  • 16
  • 112
  • 140
user262540
  • 441
  • 1
  • 4
  • 6
  • 1
    No. This is the most efficient method, unless you can parallelize the search, or write a special purpose program. – Michael Vehrs Jun 08 '16 at 05:23
  • See: [Fastest possible grep](http://stackoverflow.com/q/9066609/3776858) – Cyrus Jun 08 '16 at 05:37
  • 1
    `fgrep` instead of `grep -F` should be a little faster... – Jahid Jun 08 '16 at 05:40
  • 1
    @Jahid: from GNU grep's manpage: "_fgrep is the same as grep -F._" – Cyrus Jun 08 '16 at 05:52
  • http://lucene.apache.org/solr/ – xxfelixxx Jun 08 '16 at 06:16
  • @Cyrus : yes, but fgrep is a direct invocation, while `-F` is a switch. For multiple run, it makes a slight difference. – Jahid Jun 08 '16 at 06:49
  • 3
    @Jahid To clarify this is the content of `/usr/bin/fgrep` `#!/bin/sh exec grep -F "$@"` – Andreas Louv Jun 08 '16 at 06:59
  • @andlrc whats even the point then, why not just make it an alias? – 123 Jun 08 '16 at 07:18
  • @andlrc : In GNU grep, the `fgrep` invocation is actually deprecated..they are trying to remove `fgrep` – Jahid Jun 08 '16 at 07:29
  • If your data is plain ASCII (non-UTF), I read somewhere that it can work faster if you set a `C` locale with `export LC_ALL=C` before grepping - may be worth a try. – Mark Setchell Jun 08 '16 at 08:06
  • @123 I guess it's to still have a executable called fgrep while avoiding maintaining two sources. – Andreas Louv Jun 08 '16 at 08:51
  • You make be able to optimise your search strings... by ordering them so more frequent ones come first or by finding common parts (e.g. the word "error" may appear in 23,000 of your strings) so you could search first for "error" then for specifics once you have found one. – Mark Setchell Jun 08 '16 at 09:49
  • You may be able to use **GNU Parallel**... http://www.rankfocus.com/use-cpu-cores-linux-commands/ – Mark Setchell Jun 08 '16 at 09:51
  • @MarkSetchell I tried C locale, it did not helped. Also, all my search strings are unique – user262540 Jun 08 '16 at 17:56
  • I don't want to change the original question but in a related problem I know that I have to match the 3rd column in the large file for each search string. If there is a match the whole line is printed. What is the best solution in this case. – user262540 Jun 08 '16 at 17:59
  • See this post: [Fastest way to find lines of a file from another larger file in Bash](https://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-file-from-another-larger-file-in-bash) – codeforester Mar 03 '18 at 18:38

3 Answers3

33

I once noticed that using -E or multiple -e parameters is faster than using -f. Note that this might not be applicable for your problem as you are searching for 50,000 string in a larger file. However I wanted to show you what can be done and what might be worth testing:

Here is what I noticed in detail:

Have 1.2GB file filled with random strings.

>ls -has | grep string
1,2G strings.txt

>head strings.txt
Mfzd0sf7RA664UVrBHK44cSQpLRKT6J0
Uk218A8GKRdAVOZLIykVc0b2RH1ayfAy
BmuCCPJaQGhFTIutGpVG86tlanW8c9Pa
etrulbGONKT3pact1SHg2ipcCr7TZ9jc
.....

Now I want to search for strings "ab", "cd" and "ef" using different grep approaches:

  1. Using grep without flags, search one at a time:
    grep "ab" strings.txt > m1.out  
    2,76s user 0,42s system 96% cpu 3,313 total
    
    grep "cd" strings.txt >> m1.out  
    2,82s user 0,36s system 95% cpu 3,322 total
    
    grep "ef" strings.txt >> m1.out  
    2,78s user 0,36s system 94% cpu 3,360 total

So in total the search takes nearly 10 seconds.

  1. Using grep with -f flag with search strings in search.txt

     >cat search.txt
      ab
      cd
      ef
    
     >grep -F -f search.txt strings.txt > m2.out  
     31,55s user 0,60s system 99% cpu 32,343 total
    

For some reasons this takes nearly 32 seconds.

  1. Now using multiple search patterns with -e

     grep -E "ab|cd|ef" strings.txt > m3.out  
     3,80s user 0,36s system 98% cpu 4,220 total
    

    or

     grep --color=auto -e "ab" -e "cd" -e "ef" strings.txt > /dev/null  
     3,86s user 0,38s system 98% cpu 4,323 total
    

The third methode using -E only took 4.22 seconds to search through the file.

Now lets check if the results are the same:

cat m1.out | sort | uniq > m1.sort  
cat m3.out | sort | uniq > m3.sort
diff m1.sort m3.sort
#

The diff produces no output, which means the found results are the same.

Maybe want to give it a try, otherwise I would advise you to look at the thread "Fastest possible grep", see comment from Cyrus.

gokareless
  • 1,165
  • 1
  • 10
  • 26
cb0
  • 8,415
  • 9
  • 52
  • 80
5

You may want to try sift or ag. Sift in particular lists some pretty impressive benchmarks versus grep.

ajfabbri
  • 421
  • 3
  • 5
2

Note: I realise the following is not a bash based solution, but given your large search space, a parallel solution is warranted.


If your machine has more than one core/processor, you could call the following function in Pythran, to parallelize the search:

#!/usr/bin/env python

#pythran export search_in_file(string, string)
def search_in_file(long_file_path, short_file_path):
    _long = open(long_file_path, "r")

    #omp parallel for schedule(guided)
    for _string in open(short_file_path, "r"):
        if _string in _long:
            print(_string)

if __name__ == "__main__":
    search_in_file("long_file_path", "short_file_path")

Note: Behind the scenes, Pythran takes Python code and attempt to aggressively compile it into very fast C++.

boardrider
  • 5,882
  • 7
  • 49
  • 86