awk to identify unique information in specific column in more than one file

Question

I have a two part awk problem:

In the first part: I want to compare the second columns from two files. If there is a match, print the corresponding value in an output file.

In the second part: I also need the opposite information. Again, I want to compare the second columns from the same two files. If there is a unique string value (meaning something that appears in Column 2 in file 1 and not in Column 2 in file 2).

To solve the first part: I have used the following awk

awk 'NR==FNR { a[$1]=$2; next} $1 in a {print $0, a[$1]}' File2 File1

found here, which seems to solve the issue of identifying the matching values.

However, I cannot seem to find a solution to identifying the unique information from file 1 and printing it in a third output file. Can anyone provide any insight on how to solve this?

An example of the input is the following:

File 1

A   concept1    123
A   concept2    123
A   concept1    123
A   concept1    123
A   concept3    123

File 2

B   concept1    456
B   concept4    456
B   concept5    456
B   concept1    456
B   concept3    456

OUTPUT File 3

concept4
concept5

Thank you.

UPDATE: In the original, I have asked the question comparing 1 file against one other file. Is it possible to modify this code to compare 1 file against multiple other files?

For instance:

Input: FILE1 to be compared for any unique line against FILE2,FILE3,FILE4...FILEn OUTPUT: FILE with all unique lines from FILE1.

Could you provide some sample input and expected output? – Thor Oct 11 '13 at 08:53 — Thor, Oct 11 '13 at 08:53
Are you looking for `!($1 in a)` ? – glenn jackman Oct 11 '13 at 09:09 — glenn jackman, Oct 11 '13 at 09:09
glenn jackman: yes; Thor: check updated question. – owwoow14 Oct 11 '13 at 10:19 — owwoow14, Oct 11 '13 at 10:19

Thor · Accepted Answer · 2013-10-11T14:30:21.350

IIUC you are going about it in the wrong way. You are using $1 as the index into the array, which is the same for every record.

Small input files

One approach to your problem is to save the second column into a and check it against the second file. Something like this:

awk 'NR==FNR { a[FNR]=$2; next} $2 != a[FNR] { print $2 }' File1 File2

Output:

concept4
concept5

Large input files

The above approach will use a lot of memory if the input files are very large. In that case a better way would be to preprocess the input like so:

paste <( <File1 tr -s ' ' | cut -d' ' -f2) \
      <( <File2 tr -s ' ' | cut -d' ' -f2) | 
  awk '$1 != $2 { print $2 }'

Output:

concept4
concept5

score 1 · Answer 2 · answered Oct 11 '13 at 18:19

1

Given your posted sample input files:

$ awk 'NR==FNR{seen[$2]++;next} seen[$2]{print $2}' file1 file2
concept1
concept1
concept3

$ awk 'NR==FNR{seen[$2]++;next} !seen[$2]{print $2}' file1 file2
concept4
concept5

$ awk 'NR==FNR{seen[$2]++;next} !seen[$2]{print $2}' file2 file1
concept2

answered Oct 11 '13 at 18:19

Ed Morton

188,023
17
78
185

Could this potentially work if include more than one file? Re: updated question. Meaning, the input (as is now) considers comparing the contents of Col2 in File1 against those in File 2 and the printing out those lines that appear in File 1 and not in File 2 in a third output File 3. Is is possible to add more filed to the input to achieve the same results, i.e. input File 1 (to be compared against) File2, File3, File4...FileN and output a File with those line that only appear in File 1 and not in any of the other N files? – owwoow14 Oct 15 '13 at 09:51

awk to identify unique information in specific column in more than one file

2 Answers2