Compare all but last N Columns across two files in bash

Question

I have 2 files: one with 18 columns; another with many more. I need to find the rows that mismatch on ONLY the first 18 columns while ignoring the rest in the other file. However, I need to preserve and print the entire row (cut will not work).

File 1:

F1 F2 F3....F18
A  B  C.... Y
AA BB CC... YY

File 2:

F1 F2 F3... F18... F32
AA BB CC... YY... 123
AAA BBB CCC... YYY...321

Output Not In File 1:

AAA BBB CCC YYY...321

Output Not In File 2:

 A  B  C...Y

If possible, I would like to use diff or awk with as few loops as possible.

It would make your question clearer if you showed us a complete example using, for example, matching 3 columns out of 5. — Tom Fenech, Jun 24 '16 at 18:23
I was attempting to explain with fields. I guess assume F1 F2 F3... F18 are the first 4 columns and F32 is the 5th that ONLY exists in the second file. I only want to compare across the first four, BUT, I want to preserve that fifth column and carry it over to the output. Hope this clears things up. — JoeS., Jun 24 '16 at 18:30
[edit] your question to show **concise, testable** sample input and expected output. i.e. some input we could run a potential solution against (so get rid of all `...`s to start with) to produce the output you also show. Any other explanation/info in addition to that is nice but that's the key thing we need. — Ed Morton, Jun 24 '16 at 19:07

score 0 · Accepted Answer · answered Jun 24 '16 at 18:35

0

You can use awk:

awk '{k=""; for(i=1; i<=18; i++) k=k SUBSEP $i} FNR==NR{a[k]; next} !(k in a)' file1 file2

For each row in both files we are first creating a key by concatenating first 18 fields
We are then storing this key in an associative array while iterating first file
Finally we print each row from 2nd file when this new key value is not found in our associative array.

answered Jun 24 '16 at 18:35

anubhava

761,203
64
569
643

How do I know I am only printing the second file? Just curious. awk is still a mystery to me. – JoeS. Jun 24 '16 at 18:52
We are printing row from 2nd file only because `file2` is 2nd file in the list after `awk` command. `!(k in a)` is the action that makes awk print each row from 2nd file – anubhava Jun 24 '16 at 19:01
1

Worked as far as I can tell. Thanks! – JoeS. Jun 25 '16 at 14:25

score 0 · Answer 2 · answered Jun 24 '16 at 19:31

0

You can use grep:

grep -vf file1 file2
grep -vf <(cut -d" "  -f1-18 file2) file1

answered Jun 24 '16 at 19:31

Walter A

19,067
2
23
43

karakfa · Answer 3 · 2016-06-24T20:25:02.800

to get set differences between two files, you'll need little more, similar to @anubhava's answer

$ awk 'NR==FNR{f1[$0]; next}  
              {k=$1; for(i=2;i<=18;i++) k=k FS $i; 
               if(k in f1) delete f1[k]; 
               else f2[$0]} 
           END{print "not in f1"; 
               for(k in f2) print k; 
               print "\nnot in f2"; 
               for(k in f1) print k}' file1 file2

can be re-written to preserve order in file2

$ awk 'NR==FNR{f1[$0]; next}  
               {k=$1; for(i=2;i<=18;i++) k=k FS $i; 
                if(k in f1) delete f1[k]; 
                else {if(!p) print "not in f1";  
                      f2[$0]; print; p=1}} 
            END{print "\nnot in f2"; 
                for(k in f1) print k}' file1 file2

Compare all but last N Columns across two files in bash

3 Answers3