3

I have been searching in the forum a solution for my problem but unfortunately I was not successful. I am quite green in bash; therefore I would really appreciate any suggestion ot link to previous question/discussions

I have two files:

file1

    Ada     ada
    Ada     ada
    Ada     aidB
    Ada     alkA
    Ada     alkB
    Ada     alkB
    AdiY    adiA
    AdiY    gadA

file2

    NP_414557.1     insL1
    NP_414559.1     mokC
    YP_025292.1     hokC
    NP_414560.1     nhaA
    NP_414561.1     nhaR
    NP_414562.1     insB1
    NP_414563.1     insA
    NP_414564.1     rpsT

I would like to compare file1 and file2 based on the second column. If the object is the same I would like to append the text in the column 1 in file2 relative to that specific shared entry in a third column in file1.

Expected output: file1

    PhoB  caiT  NP_414580.1     
    PhoP  caiE  NP_414581.1     
    PhoP  caiB  NP_414582.1 

Please consider the text reported just as example.

I am struggling with awk trying to adjust the code from a previouse question:

Compare two files of different columns and print different columns

awk 'NR==FNR{a[$1]=$2 ; next }$2 in a {$1=$1 FS a[$2]print}' file1 file2

Thank you very much in advance for your time and your help.

Best.

Community
  • 1
  • 1
efrem
  • 455
  • 9
  • 23
  • 2
    Your sample input looks a bit truncated. Where is `caiT` coming from? – fedorqui May 07 '14 at 10:55
  • Your `file1` and `file2` appear to have absolutely nothing in common, and your proposed output has nothing in common with either of them. Consider posting some example data that at least has a minimal chance of demonstrating what you are asking about... – twalberg May 07 '14 at 18:39
  • Hi, thank you for your answer. As I wrote the data I reported are just example. The second column *file1*, *file2*, and the output have the same kind of information. Just imagine that the PhoB and caiT will be present in *file1* and caiT and NP_414508.1 are present in *file2*. I hope this makes it clearer. – efrem May 08 '14 at 13:05
  • It would be great if you could explain the code. Thanks. – efrem May 13 '14 at 14:36

2 Answers2

3

As @fedorqui commented, your example inputs/output are not consistent. I think this should do the trick though:

awk 'NR==FNR{a[$2]=$0; next} a[$2]>0{print a[$2],$1}' file1 file2

file1:

A alice
B bob
C carol
D dan

file2:

1 dan
2 alice
3 carol
4 bob

Output:

$ awk 'NR==FNR{a[$2]=$0} NR>FNR && a[$2]>0{print a[$2],$1}' file1 file2
D dan 1
A alice 2
C carol 3
B bob 4

Output can be sorted by whatever column you choose using sort -k. Breaking down the awk code:

  • NR==FNR{a[$2]=$0; next} - NR is an awk variable which contains the total number of lines processed so far. FNR is similar, but only contains the number of processes lines of the current file, so this condition effectively means "only do this for the first input file". The associated action stores the entire line ($0) in the associative array a, with the index being the value of the second field of the row. next just means that awk should move to the next line without doing further processing.

  • a[$2]>0{print a[$2],$1} - this condition will only be tested for the second and subsequent input files. The second field is used as an index to look up a value from the a - if the value is greater than 0 (ie, the value is a string in this case) then the value is printed, followed by the first field of the current line.

Basically, every line of the first file is stored in an array, indexed by the second field of the line. If that second field is matched in the second file, then the whole line from the first file is printed, with the second field from the second file appended.

Josh Jolly
  • 11,258
  • 2
  • 39
  • 55
  • Hi Josh thank you so much for your answer! Sorry if my example was not consistent. Anyway you got it perfectly right. – efrem May 09 '14 at 18:38
  • I tried what you suggested, simply copying and pasting your code after changing the file names (actually you suggested two different codes, I tried both). Unfortunately I did not have any output. Any idea Why? Thank you again! – efrem May 09 '14 at 18:57
  • Dear Josh, you were perfectly right!! Your script does the job! I created two test files and I used them obtaining a correct output. However, using my real files the script does not work. I controlled my files and they are tab delimited. Though, apparently there is something wrong with the formatting. Any idea how I could check that? Thank you again for your help! – efrem May 13 '14 at 14:28
  • See http://stackoverflow.com/questions/5374239/tab-separated-values-in-awk for help using `awk` with tab-separated files. It would also be worth making sure that your files have Unix line endings rather than DOS line endings - try `dos2unix` on them. – Josh Jolly May 13 '14 at 15:12
  • OK! I corrected the files and it seems that it worked perfectly! Thank you so much! Just to understand a bit more about awk, could you please write a small explanation of the code?? Thanks again! – efrem May 13 '14 at 17:17
  • I have edited my post with an explanation - let me know if this is enough. Consider accepting the answer if it is helpful. :) – Josh Jolly May 13 '14 at 17:27
1

You can use join command to link fields on files. The simplest way to use it can be:

 join -j 2

to obtain an output with the first and second file of first file and following the first field of second file you can use the -o option as follow:

 join -j 2 -o 1.1 1.2 2.1

it is required that the lines in both files are sorted on second field, if the lines are not correctly sorted you can use the following bash trick:

 join -j 2  -o 1.1 1.2 2.1 <(sort -k2 1) <(sort -k2 2)

Supposing this is the content of two files:

 $ cat 1
 FIRST   first
 SECOND  second
 THIRD   third
 FOURTH  fourth
 $ cat 2
 2       second
 1       first
 3       third
 4       fourth

Note: the field separator on each line is the TAB character

The final result:

 $ join -j 2 -o 1.1 1.2 2.1 <(sort 1) <(sort -k2 2)
 FIRST first 1
 FOURTH fourth 4
 SECOND second 2
 THIRD third 3