1

I am using grep -v to parse strings which are not present in desired file from target file, however the process was taking too long (>12 hours) and was eventually killed by the machine without finishing the task.

The command I used:

grep -v -f desire.txt target.txt >> no_in_desire_file.txt

The desire.txt has 45502 strings; target.txt has 268101 strings.

Could someone shares with me your experience to speed up the grep process? I am not good in Python or Perl, unfortunately.

UPDATED:

The suggestion by @John1024 improved the speed of the grep process.

If it contains just plain strings, then add the -F option for fixed strings. This greatly speeds grep. – John1024

Jens
  • 69,818
  • 15
  • 125
  • 179
KJ Lim
  • 107
  • 1
  • 9
  • How large in bytes are desire.txt, target.txt and no_in_desire_file.txt? – Cyrus Nov 06 '14 at 07:15
  • In many cases, import of a text file into a db of choice (such as postgresql) and designing an appropriate query is an only reasonable way to proceed. – oakad Nov 06 '14 at 07:35
  • http://stackoverflow.com/questions/18204904/fast-way-of-finding-lines-in-one-file-that-are-not-in-another - duplicate, as always. :) – oakad Nov 06 '14 at 07:40
  • 2
    Does `desire.txt` contain strings or regular expressions? If it contains just plain strings, then add the `-F` option for fixed strings. This greatly speeds `grep`. – John1024 Nov 06 '14 at 07:58
  • @John1024, both text files contain strings like: comp100014_c0 comp0_c0_seq1. The length of the strings is different. Does -F works in this case? The desire.txt has 45502 strings, target.tx has 268101 strings – KJ Lim Nov 06 '14 at 08:05
  • @oakad, thanks for pointing out. I did searched around before posting this question, perhaps, I missed that thread. I read that post and tried with the diff solution, but, it did not work. – KJ Lim Nov 06 '14 at 08:07
  • @KJLim Yes, `-F` would work for those strings. – John1024 Nov 06 '14 at 08:11
  • @John1024 Thanks for the suggestion. The grep -F works, but, the number of output is a bit strange. The desire.txt has 45502 strings, the target.txt has 268101 strings. The total line of output is 217315. Hummm..... – KJ Lim Nov 06 '14 at 08:36
  • @KJLim From that, I would conclude that you had some strings in `desire.txt` that matched more than one line in `target.txt`. – John1024 Nov 06 '14 at 08:42

1 Answers1

3

If the strings that you are matching are not regular expressions, then a large speed-up is possible by specifying grep's -F option.

grep is capable of processing patterns in the form of very complex and powerful regular expressions. Consider, for example:

$ echo mississippi | grep -E 'm(.*is)+.*i'
mississippi

In this case, grep looks for the letter m followed by one or more occurrences of a group consisting of any number of characters followed by is, all followed by any number of characters and then an i. Computing such matches can be quite complicated.

In your case, however, you patterns are simple strings like:

comp100014_c0
comp0_c0_seq1

For these strings, we are looking for simple matches. This requires no fancy computation. To speed up grep, we can tell it that our strings are all simple. We do this by specifying the -F option. In man grep, this feature is documented as:

-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified by POSIX.)

John1024
  • 109,961
  • 14
  • 137
  • 171