1

I have a two part awk problem:

In the first part: I want to compare the second columns from two files. If there is a match, print the corresponding value in an output file.

In the second part: I also need the opposite information. Again, I want to compare the second columns from the same two files. If there is a unique string value (meaning something that appears in Column 2 in file 1 and not in Column 2 in file 2).

To solve the first part: I have used the following awk

awk 'NR==FNR { a[$1]=$2; next} $1 in a {print $0, a[$1]}' File2 File1

found here, which seems to solve the issue of identifying the matching values.

However, I cannot seem to find a solution to identifying the unique information from file 1 and printing it in a third output file. Can anyone provide any insight on how to solve this?

An example of the input is the following:

File 1

A   concept1    123
A   concept2    123
A   concept1    123
A   concept1    123
A   concept3    123

File 2

B   concept1    456
B   concept4    456
B   concept5    456
B   concept1    456
B   concept3    456

OUTPUT File 3

concept4
concept5

Thank you.

UPDATE: In the original, I have asked the question comparing 1 file against one other file. Is it possible to modify this code to compare 1 file against multiple other files?

For instance:

Input: FILE1 to be compared for any unique line against FILE2,FILE3,FILE4...FILEn OUTPUT: FILE with all unique lines from FILE1.

Community
  • 1
  • 1
owwoow14
  • 1,694
  • 8
  • 28
  • 43

2 Answers2

1

IIUC you are going about it in the wrong way. You are using $1 as the index into the array, which is the same for every record.

Small input files

One approach to your problem is to save the second column into a and check it against the second file. Something like this:

awk 'NR==FNR { a[FNR]=$2; next} $2 != a[FNR] { print $2 }' File1 File2

Output:

concept4
concept5

Large input files

The above approach will use a lot of memory if the input files are very large. In that case a better way would be to preprocess the input like so:

paste <( <File1 tr -s ' ' | cut -d' ' -f2) \
      <( <File2 tr -s ' ' | cut -d' ' -f2) | 
  awk '$1 != $2 { print $2 }'

Output:

concept4
concept5
Thor
  • 45,082
  • 11
  • 119
  • 130
1

Given your posted sample input files:

$ awk 'NR==FNR{seen[$2]++;next} seen[$2]{print $2}' file1 file2
concept1
concept1
concept3

$ awk 'NR==FNR{seen[$2]++;next} !seen[$2]{print $2}' file1 file2
concept4
concept5

$ awk 'NR==FNR{seen[$2]++;next} !seen[$2]{print $2}' file2 file1
concept2
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Could this potentially work if include more than one file? Re: updated question. Meaning, the input (as is now) considers comparing the contents of Col2 in File1 against those in File 2 and the printing out those lines that appear in File 1 and not in File 2 in a third output File 3. Is is possible to add more filed to the input to achieve the same results, i.e. input File 1 (to be compared against) File2, File3, File4...FileN and output a File with those line that only appear in File 1 and not in any of the other N files? – owwoow14 Oct 15 '13 at 09:51