comparing 2 files with a searchable key

Question

Good day,

I have been trying to find a solution to a similar problem asked in comparing 2 files using awk but I can't seem to get my head around it. Looking for some kind assistance.

I have 2 files that I would like to compare. Mock-up contents of file1 and file2 are shown below:

file1:

50       0004312805201        06740         2310821                                                                                                                                
50      0004986504201        00845         2310837                                                                                                                                
50      0003913155201        47679         2310762                                                                                                                                
50      0004997395201        2035          2311180                                                                                                                                
50      0001147242201        15000         23108723                                                                                                                                
50      0005771878201        13545         I3840000

file2:

0003913155 A

0005771878 A

0004312805 A

0000000015 B

0000000012 B

1111111111 E

I need to perform a substring on field2 from file1 to produce a 10 character length searchable key value and find the matching value in field1 of file2.

If a match is found, print out the entire file1 row with field2 from file2 appended as a new field.

If no match, print out the entire file1 row with the string "NO" appended as a new field. Output would be best redirected to a file.

Example output is shown below.

Output:

50       0004312805201        06740         2310821 A                                                                                                                               
50      0004986504201        00845         2310837 NO                                                                                                                             
50      0003913155201        47679         2310762 A                                                                                                                               
50      0004997395201        2035          2311180 NO                                                                                                                             
50      0001147242201        15000         23108723 NO                                                                                                                               
50      0005771878201        13545         I3840000 A

How would you guys suggest I tackle this via awk or GNU-awk? Having issues in preparing the searchable key substring and using it in awk/GNU-awk to build the array.

Any help would be greatly appreciated. I am spinning my wheels at this point.

Thanks.

*to produce a 10 character length searchable key* - could that key NOT start from the beginning of the field? — RomanPerekhrest, Sep 16 '17 at 07:38
Possible duplicate of [using awk to match a column in log file and print the entire line](https://stackoverflow.com/questions/22076186/using-awk-to-match-a-column-in-log-file-and-print-the-entire-line) — tk421, Sep 16 '17 at 07:58

Akshay Hegde · Answer 1 · 2017-09-16T07:20:50.167

awk '
     FNR==NR{ a[$1]=$2; next }
     { s=substr($2,1,10); print $0,(s in a ?a[s]:"No") }
    ' file2 file1 > your_output_file

Input:

$ cat file1
50 0004312805201 06740 2310821
50 0004986504201 00845 2310837
50 0003913155201 47679 2310762
50 0004997395201 2035 2311180
50 0001147242201 15000 23108723
50 0005771878201 13545 I3840000 

$ cat file2
0003913155 A
0005771878 A
0004312805 A
0000000015 B
0000000012 B
1111111111 E

Output

$ awk 'FNR==NR{a[$1]=$2;next}{s=substr($2,1,10);print $0, (s in a ? a[s] : "No") }' file2 file1
50 0004312805201 06740 2310821 A
50 0004986504201 00845 2310837 No
50 0003913155201 47679 2310762 A
50 0004997395201 2035 2311180 No
50 0001147242201 15000 23108723 No
50 0005771878201 13545 I3840000  A

I will review the feedback provided tomorrow and reply back with comments. Thanks to all that responded. — jadeabc, Sep 16 '17 at 08:44

Marc Lambrichs · Answer 2 · 2017-09-16T16:56:34.030

Not sure what OP means with produce a 10 character length searchable key value. I interpret it as: the value in field 1 of file2 must be a substring of field 2 in file1.

$ cat tst.awk
/^[0-9]/ && NR==FNR { a[$1]=$2; next }   # read values from file2 in array
/^[0-9]/{
   f=0;
   for (i in a){                         # loop over field 1 of file2
      if (index($2, i)){                 # if i can be found in field 2 of file1
         print $0, a[i];                 # print $0 with $2 from file2
         f++;
         break;
      }
   }
}
/^[0-9]/ && !f{ print $0, "NO" }         # if no match, print "NO" line

With input of

$ cat file1
50 0004312805201 06740 2310821
50 0004986504201 00845 2310837
50 0003913155201 47679 2310762
50 0004997395201 2035 2311180
50 0001147242201 15000 23108723
50 0005771878201 13545 I3840000

and

$ cat file2
0003913155 A

0005771878 A

0004312805 A

0000000015 B

0000000012 B

1111111111 E

calling tst.awk will generate output:

$ awk -f tst.awk file2 file1
50 0004312805201 06740 2310821 A
50 0004986504201 00845 2310837 NO
50 0003913155201 47679 2310762 A
50 0004997395201 2035 2311180 NO
50 0001147242201 15000 23108723 NO
50 0005771878201 13545 I3840000 A

Or, use a oneliner:

$ awk '/^[0-9]/ && NR==FNR { a[$1]=$2; next } /^[0-9]/{f=0;for (i in a){if (index($2, i)){print $0, a[i];f++;break;}}}/^[0-9]/ && !f{ print $0, "NO" }' file2 file1

comparing 2 files with a searchable key

2 Answers2