2

I need to find records from data.txt which are not matching data in filter.txt. Earlier I used grep -vf filter.txt data.txt which was working correctly but was very slow.

As per discussion in grep -vf too slow with large files I switched to

awk 'FNR==NR {hash[$0]; next} !($0 in hash)' filter.txt data.txt > op.txt

which works if filter.txt is not empty.

data.txt

data1
data2
data3

filter.txt

data1

op.txt

data2
data3

but fails if filter.txt is empty. If filter.txt is empty then output op.txt is also empty. Ideally it should be equal to data.txt.

Tried with ARGIND==1. Seems to work for empty filter.txt but producing wrong results for non-empty filter.txt. Expected output is present above.

$ cat filter.txt 
abc2
$ awk 'ARGIND==1{hash[$0]; next} !($0 in hash)' filter.txt data.txt > op.txt
$ cat op.txt 
abc2
abc1
abc2
abc3
$ vi filter.txt 
$ cat filter.txt 
$ awk 'ARGIND==1{hash[$0]; next} !($0 in hash)' filter.txt data.txt > op.txt
$ cat op.txt 
abc1
abc2
abc3
Community
  • 1
  • 1
user3150037
  • 172
  • 1
  • 10
  • You can use `ls -s` to see if a file is empty and if that matches 0 then skip that file, or if you want an all awk solution then check is `NR > 2` and only process if so, or similar (something like `awk 'END{print(NR>2)?"NOT EMPTY":"EMPTY"}'` – Dan Mar 10 '17 at 17:55

1 Answers1

1

Change FNR==FNR to ARGIND==1 if you have GNU awk or FILENAME==ARGV[1] otherwise.

$ awk --version | head -1
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.5, GNU MP 6.1.2)

$ awk 'ARGIND==1{hash[$0]; next} !($0 in hash)' filter.txt data.txt
data2
data3

$ awk --posix 'ARGIND==1{hash[$0]; next} !($0 in hash)' filter.txt data.txt
data1
data1
data2
data3

$ awk --posix 'FILENAME==ARGV[1]{hash[$0]; next} !($0 in hash)' filter.txt data.txt
data2
data3
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Thanks for your response but above solution works if filter.txt is empty but produces wrong results if filter.txt is not empty. I am looking for a solution that takes care of both cases. – user3150037 Mar 10 '17 at 18:24
  • No, it does not produce incorrect results. Try again. – Ed Morton Mar 10 '17 at 18:24
  • I tried again but still incorrect results. I have updated question with the results. Please have a look. – user3150037 Mar 10 '17 at 18:33
  • As I said, "use `ARGIND==1` **if you have GNU awk**". If `ARGIND==1` isn't working for you then clearly you aren't using GNU awk. I updated the answer to show the GNU vs POSIX functionality. – Ed Morton Mar 10 '17 at 18:34
  • 1
    Yes. Mawk was installed on my machine. It works with gawk. Thanks a lot. – user3150037 Mar 10 '17 at 18:44