0

I have 2 files: one with 18 columns; another with many more. I need to find the rows that mismatch on ONLY the first 18 columns while ignoring the rest in the other file. However, I need to preserve and print the entire row (cut will not work).

File 1:

F1 F2 F3....F18
A  B  C.... Y
AA BB CC... YY

File 2:

F1 F2 F3... F18... F32
AA BB CC... YY... 123
AAA BBB CCC... YYY...321

Output Not In File 1:

AAA BBB CCC YYY...321

Output Not In File 2:

 A  B  C...Y

If possible, I would like to use diff or awk with as few loops as possible.

JoeS.
  • 73
  • 6
  • 3
    It would make your question clearer if you showed us a complete example using, for example, matching 3 columns out of 5. – Tom Fenech Jun 24 '16 at 18:23
  • I was attempting to explain with fields. I guess assume F1 F2 F3... F18 are the first 4 columns and F32 is the 5th that ONLY exists in the second file. I only want to compare across the first four, BUT, I want to preserve that fifth column and carry it over to the output. Hope this clears things up. – JoeS. Jun 24 '16 at 18:30
  • 1
    [edit] your question to show **concise, testable** sample input and expected output. i.e. some input we could run a potential solution against (so get rid of all `...`s to start with) to produce the output you also show. Any other explanation/info in addition to that is nice but that's the key thing we need. – Ed Morton Jun 24 '16 at 19:07

3 Answers3

0

You can use awk:

awk '{k=""; for(i=1; i<=18; i++) k=k SUBSEP $i} FNR==NR{a[k]; next} !(k in a)' file1 file2
  • For each row in both files we are first creating a key by concatenating first 18 fields
  • We are then storing this key in an associative array while iterating first file
  • Finally we print each row from 2nd file when this new key value is not found in our associative array.
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • How do I know I am only printing the second file? Just curious. awk is still a mystery to me. – JoeS. Jun 24 '16 at 18:52
  • We are printing row from 2nd file only because `file2` is 2nd file in the list after `awk` command. `!(k in a)` is the action that makes awk print each row from 2nd file – anubhava Jun 24 '16 at 19:01
  • 1
    Worked as far as I can tell. Thanks! – JoeS. Jun 25 '16 at 14:25
0

You can use grep:

grep -vf file1 file2
grep -vf <(cut -d" "  -f1-18 file2) file1
Walter A
  • 19,067
  • 2
  • 23
  • 43
0

to get set differences between two files, you'll need little more, similar to @anubhava's answer

$ awk 'NR==FNR{f1[$0]; next}  
              {k=$1; for(i=2;i<=18;i++) k=k FS $i; 
               if(k in f1) delete f1[k]; 
               else f2[$0]} 
           END{print "not in f1"; 
               for(k in f2) print k; 
               print "\nnot in f2"; 
               for(k in f1) print k}' file1 file2

can be re-written to preserve order in file2

$ awk 'NR==FNR{f1[$0]; next}  
               {k=$1; for(i=2;i<=18;i++) k=k FS $i; 
                if(k in f1) delete f1[k]; 
                else {if(!p) print "not in f1";  
                      f2[$0]; print; p=1}} 
            END{print "\nnot in f2"; 
                for(k in f1) print k}' file1 file2
karakfa
  • 66,216
  • 7
  • 41
  • 56