3

I have two files looking like this:

file1:

RYR2 29 70  0.376583106063  4.77084855376
MUC16 51 94 0.481067457376  3.9233164551
DCAF4L2 0 13    0.0691414496833 3.05307268261
USH2A 32 62 0.481792717087  2.81864194236
ZFHX4 14 37 0.371576262084  2.81030548752

file2:

A26B2
RYR2
MUC16
ACTL9

I need to compare them based on first column and print only those lines of first file that are not in second, so the output should be:

DCAF4L2 0 13    0.0691414496833 3.05307268261
USH2A 32 62 0.481792717087  2.81864194236
ZFHX4 14 37 0.371576262084  2.81030548752

I tried with grep:

 grep -vFxf file2 file1

with awk:

awk 'NR==FNR {exclude[$0];next} !($0 in exclude)' file 2 file1

comm:

comm -23 <(sort file1) <(sort file2)

nothing works

Miss
  • 35
  • 5
  • 1
    `grep -vf file2 file1` works fine for me. Check your files for DOS line endings. – Cyrus Apr 26 '18 at 18:00
  • @Cyrus it works! thank you so much – Miss Apr 26 '18 at 18:11
  • 2
    Your awk is very close: you just want to see if the *first word* is in the array: `!($1 in exclude)` (and of course remove the space between "file" and "2" to get the right file) – glenn jackman Apr 26 '18 at 18:25
  • 1
    For your grep command, the `-x` option is the incorrect bit here: that instructs grep to compare the whole line to the patterns. – glenn jackman Apr 26 '18 at 18:30

1 Answers1

0

You can use

grep -vFf file2 file1

Also, grep -vf file2 file1 will work, too, but in case the file2 strings contain * or [ that should be read in as literal chars you might get into trouble since they should be escaped. F makes grep treat those strings as fixed strings.

NOTES

  • -v: Invert match.
  • -f file: Take regexes from a file.
  • -F: Interpret the pattern as a list of fixed strings (instead of regular expressions), separated by newlines, any of which is to be matched.

So, it reads the regexes from file2 and applies them to file1, and once it finds a match, that line is not output due to inverted search. This is enough because only the first column contains alphanumerics, the rest contain numeric data only.

Why your command did not work

The -x (short for --line-regexp) option means Select only those matches that exactly match the whole line.

Also, see more about grep options in grep documentation.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563