0

I have a file (note that some lines have more than 2 columns, also some lines are 1 space delimited, and some are multiple space delimited, this file is quite large...)

 file1.txt:
there is a line here that has more than two columns
## this line is a comment
blahblah:     blahblahSierraexample7272
foo: foo@foobar.com
nonsense:                    nonsense59s59S
nonsense:   someRandomColumn
.....

I have another file that is a subset of file1.txt, this file has two columns only and columns are "1" space delimited!

file2.txt
foo: foo@foo.com
nonsense: nonsense59s59S

now, I would like to delete all lines that appear in file2.txt from file1.txt, how can I do that in a shell script? note that the second file (file2.txt) has two columns only, while file1.txt has multiple... so if a matching needs to be done it should be like: $1(from file2) match $1(from file1) and $NF(from file2) match $NF(from file1) and then inverse the match and print...

P.S. already tried grep -vf file2.txt file1.txt but since the space between column1 and $NF is not fixed it didn't work... sed and awk should do the trick but can't come up with the code...

sed -i '/^<firstColumnOfFile2> .* <lastColumnOfFile2>$/d' file1.txt (perhaps in a while loop!)

or something like: grep -vw -f ^[(1stColofFile2)] and also [(lastColOfFile2)]$ file1.txt

eth0
  • 1
  • 1

2 Answers2

0

You can use sed to turn the lines in file2.txt into regular expressions that match one or more spaces after the colon, and then use grep to remove the lines from file1.txt that match those:

$ grep -Evf <(sed 's/^\([^:]*\): /^\1:[[:space:]]+/' file2.txt) file1.txt
there is a line here that has more than two columns
## this line is a comment
blahblah:     blahblahSierraexample7272
foo: foo@foobar.com
nonsense:   someRandomColumn
Shawn
  • 47,241
  • 3
  • 26
  • 60
  • The code gives me exactly what I needed.. I don't quite follow the logic though.. Could you please explain how it works? I'm just trying to learn something!! :-) – eth0 May 30 '20 at 01:03
  • The code does not give you exactly what you need. To see that, add the following line to `file2.txt`: `there columns`. With this, the first line in `file1` should be discarded, but it won't be. Shawn's solution assumes (incorrectly) that the first field in `file1.txt` is always terminated by colon (the `:` punctuation mark), when your sample file shows that that's not the case. –  May 30 '20 at 01:33
0
$ awk 'NR==FNR{a[$0]; next} {orig=$0; $1=$1} !($0 in a){print orig}' file2 file1
there is a line here that has more than two columns
## this line is a comment
blahblah:     blahblahSierraexample7272
foo: foo@foobar.com
nonsense:   someRandomColumn
.....
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 2
    Thanks Ed. my test run is quite happy with your code... I'm not quite handy in awk though! could you please elaborate a little? or just point me to a good tutorial!! Just trying to learn something new here :-) – eth0 May 30 '20 at 01:09
  • 2
    @eth0 - I believe Ed missed part of your problem explanation (the part that contradicted something you had said earlier). If I understand correctly, if a line in `file1.txt` has five fields, it may **still** be excluded if it matches a two-field line in `file2.txt`, as follows: the first fields match, and the **last** fields match. The line from `file1.txt` may have other fields in between, those should be ignored when deciding if there is a match. This can be fixed, for example, as follows: `awk 'NR==FNR{a[$0]; next} !($1 " " $NF in a)' file2.txt file1.txt` –  May 30 '20 at 02:08
  • @etho it stores each line from file2 in array a[]. It then removes leading/trailing blanks from and changes all sequences of spaces to a single blank in each line of file1 and if the resulkting string appears in `a[]` (the contents of file2) then it prints that line from file1. If that doesn't do what you want then edit your question to include concise, testable sample input and expected output that includes cases where this doesn't work. – Ed Morton May 30 '20 at 11:50
  • @mathguy you may be right, I took the OPs question as meaning they wanted to match lines exactly after all spaces had been compressed. I should have just insisted on sample input/output from the start - we'll see what the OP posts now. – Ed Morton May 30 '20 at 11:54
  • @EdMorton - I want to print all unique lines in file1! now here is my special case: file2 (a subset of file1) has exactly 2 fields that are space delimited, and file1 has multiple fields and no-specific-pattern delimited! (look at the sample provided in the question!) but the thing is that file2's first field should match file1's first, and file2's second field(or `$NF` since file2 only has two fields) should match files1's $NF, and if this match is found, then delete from file1! in any event, like I said, your first code runs as expected on my data! thanks :-) – eth0 May 30 '20 at 22:05
  • @EdMorton - one last question if I may, and this is the last one... which part of the command "removes leading/trailing blanks from and changes all sequences of spaces to a single blank in each line of file1". Like I said your code works perfectly fine, but I don't see an $NF matching! – eth0 Jun 03 '20 at 21:23
  • @eth0 `$1=$1`, see https://www.gnu.org/software/gawk/manual/gawk.html#Changing-Fields for how assigning to a field rebuilds $0. Yeah, I don't know how it can work either given what you said in your comments after I had answered but you tell me it works and you didn't update your question to show cases where it doesn't so I moved on. – Ed Morton Jun 03 '20 at 21:25