5

I am trying filter data from data.txt using patterns stored in a file filter.txt. Like below,

grep -v -f filter.txt data.txt > op.txt

This grep takes more than 10-15 minutes for 30-40K lines in filter.txt and ~300K lines in data.txt.

Is there any way to speed up this?

data.txt

data1
data2
data3

filter.txt

data1

op.txt

data2
data3

This works with solution provided by codeforester but fails when filter.txt is empty.

user3150037
  • 172
  • 1
  • 10
  • 3
    Please include sample lines from both the files. You may want to take a look at this post which has extensive discussion on this matter: http://stackoverflow.com/questions/42239179/fastest-way-to-find-lines-of-a-text-file-from-another-larger-text-file-in-bash – codeforester Mar 09 '17 at 18:06
  • 1
    Thanks for links. Good discussion about similar problem. awk 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt works for matching lines but need inverted results. Not sure how to make it work for invert match. – user3150037 Mar 09 '17 at 19:26

1 Answers1

7

Based on Inian's solution in the related post, this awk command should solve your issue:

awk 'FNR==NR {hash[$0]; next} !($0 in hash)' filter.txt data.txt > op.txt
Community
  • 1
  • 1
codeforester
  • 39,467
  • 16
  • 112
  • 140
  • Yup. Found it . Thanks:) – user3150037 Mar 09 '17 at 19:45
  • This command returns empty op.txt file if filter.txt is empty though data.txt has lines. In ideal case, it should return all records from data.txt – user3150037 Mar 10 '17 at 16:40
  • Works correctly for me. Are there leading/trailing spaces in your files? – codeforester Mar 10 '17 at 17:05
  • awk statement is used is loop with some condition. Because of that condition sometimes filter.txt is empty and that time I get op.txt as empty though data.txt has data lines. In this case, op.txt should be equal to data.txt as pattern to match is nothing (filter.txt is empty). – user3150037 Mar 10 '17 at 17:11
  • I tried fixing but my methods don't seem to work. Hope @karakfa can help. – codeforester Mar 10 '17 at 17:40
  • 1
    Thanks for your help @codeforester. I'll post as seperate question to make it available to more audience – user3150037 Mar 10 '17 at 17:43
  • Please do. `awk 'FNR==NR {hash[$0]; n++; next} (!($0 in hash) || n == 0)' file1 file2` didn't work, by the way. Neither did `awk 'BEGIN {hash[""]=0;} FNR==NR {hash[$0]; next} !($0 in hash)' file1 file2`. – codeforester Mar 10 '17 at 17:45
  • Just for the record the method `!($0 in hash)` does not work for me either. It has to be `(!($0 in hash))`. – George Vasiliou Mar 12 '17 at 02:34
  • `(!($0 in hash))` is same as `!($0 in hash)` - an extra `()` doesn't make a difference. And it doesn't solve the issue of `awk` not producing any output when `filter.txt` has nothing. – codeforester Mar 12 '17 at 07:20