-1

I have the following for getting the lines matching a variable using awk:

for i in `cat file1`; do awk -v id="$i" '$0 ~ id' file2; done

How can I do the opposite? Getting the lines that DON'T match? I think I should use ! somewhere, but don't know where.

File 1 looks like this:

5NX1D
5NX3D
4NTYB

File 2 looks like this:

2R9PA IVGGYTCEENS
2RA3C RPDFCLEPPYT
6HARE YVDYKDDDDKE
4NTYB EYVDYKDDDDD

Output should look like this:

2R9PA IVGGYTCEENS
2RA3C RPDFCLEPPYT
6HARE YVDYKDDDDKE
Caterina
  • 775
  • 9
  • 26
  • Could you please post samples of input and expected output(not my downvote btw), thank you. – RavinderSingh13 May 17 '21 at 17:48
  • Thanks for edit so you want to compare 1st fields of both the files? Kindly confirm once. – RavinderSingh13 May 17 '21 at 17:51
  • Yes that's right. I want to remove from file2 the lines that contain the fields in file1 – Caterina May 17 '21 at 17:53
  • ok sure, could you please post expected output in your question, that will make it more clear, thank you. – RavinderSingh13 May 17 '21 at 17:53
  • 1
    see for example: https://stackoverflow.com/a/32747544/1435869 – karakfa May 17 '21 at 17:54
  • https://mywiki.wooledge.org/DontReadLinesWithFor – tripleee May 17 '21 at 18:00
  • What I think Ravinder tries to ask is, should we remove a line which contains a match only in the second field? Your prose description seems to say yes, but your example looks like the opposite, and usually that would make more sense in most scenarios. – tripleee May 17 '21 at 18:04
  • In other words. if the first line were `2R9PA randomrandom5NX1Drandomrandom`, should that be deleted, or kept? (You probably have much longer second fields if you hre working with bioinformatics, so it's not entriely unlikely that the second field could contain a substring match somewhere within it. – tripleee May 17 '21 at 18:06
  • It should be kept. I'm only comparing first fields in both files. The second field is meant to be a looong sequence that I want to retain in my output. And yeah I'm precisely working with bioinformatics :) – Caterina May 17 '21 at 18:08

1 Answers1

2

This is the standard inner join, except we print if a match was not found.

awk 'NR==FNR { a[$1]; next }
    !($1 in a)' file1 file2 >newfile2

This is very standard Awk idiom all the way, but very briefly, the line number within the current file FNR will be equal to the overall input line number NR while we are traversing the first input file, file1. If so, we add its contents to the array a, and skip to the next line. Else, if we fall through, we are no longer in the first file; we print if the first field $1 is not in a.

Your original script would be more idiomatic and a lot more efficient phrased similarly (just without the !), too.

tripleee
  • 175,061
  • 34
  • 275
  • 318