1

I am trying to combine data from two different files. In each file, some data is linked to some ID. I want to 'combine' both files in the sense that all ID's must be printed to a new file, and data from both files must be correctly matched to the ID. Example:

cat file_1
1.01    data_a
1.02    data_b
1.03    data_c
1.04    data_d
1.05    data_e
1.06    data_f
cat file_2
1.01    data_aa
1.03    data_cc
1.05    data_ee
1.09    data_ii

The desired result is:

cat files_combined
1.01    data_a    data_aa
1.02    data_b
1.03    data_c    data_cc
1.04    data_d    
1.05    data_e    data_ee
1.06    data_f
1.09              data_ii

I know how to do it the long, slow way through looping over each ID. Somewhat pseudocode example:

awk -F\\t '{print $1}' file_1 > files_combined
awk -F\\t '{print $1}' file_2 >> files_combined
sort -u -n files_combined > tmp && mv tmp files_combined

count=0
while read line; do
    count++
    ID=$line
    value1=$(grep "$ID" file_1 | awk -F\\t '{print $2}')
    value2=$(grep "$ID" file_2 | awk -F\\t '{print $2}')
    awk -F\\t 'NR=='$count' {$2='$value1' && $3='$value2'} 1' OFS="\t" files_combined > tmp && mv tmp files_combined
done < files_combined

This does the job for a file with 10 lines, but with 100000 lines it simply takes too long. I'm just looking for that magic awk solution that is there without a doubt.

Solution provided by bob dylan:

join -j -a 1 -a 2 -t $'\t' -o auto file_1 file_2

BorisRu
  • 227
  • 1
  • 8

2 Answers2

6

Does it have to be awk, or did you choose this because you think that's the best - easiest way?

You can do this via join

$join -j 1 -a 1 -a 2 -o auto file_1 file_2 | column -t -s' ' -o' '
1.01 data_a data_aa
1.02 data_b
1.03 data_c data_cc
1.04 data_d
1.05 data_e data_ee
1.06 data_f
1.09        data_ii

edit: As per the excellent suggestion from KamilCuk you can preserve the output afterwards.

bob dylan
  • 1,458
  • 1
  • 14
  • 32
  • 3
    and you could format the output even as the OP wants by pipeing to column like `| column -t -s' ' -o' '` – KamilCuk Dec 12 '19 at 16:11
  • Damn, I didn't know you could use `-a 1 -a 2`, I always thought you could only use one of the two. – Aaron Dec 12 '19 at 16:12
  • @KamilCuk excellent suggestion. I didn't bother sorting the output for them, but I've edited my post to include this also. – bob dylan Dec 12 '19 at 16:15
  • This works perfectly! Indeed, it didn't need to be with awk I just thought it would be easier to use. Thanks! – BorisRu Dec 12 '19 at 16:23
  • 1
    Just wanted to add to this, my input files are tab seperated and output has to be tab seperated as well, so what I did to accomplish this is add ```-t $'\t'``` to the join statement. Then you also do not need the ```|column .....``` part! – BorisRu Dec 13 '19 at 09:33
  • You can join on -t instead of converting it with sed. https://stackoverflow.com/questions/1722353/unix-join-separator-char – bob dylan Dec 13 '19 at 09:36
  • For example: ```join '-t ' -j 1 -a 1 -a 2 -o auto file_1 file_2``` you don't need to pipe the output. Note on this the -t is done by pressing ctrl+v and then tab edit: or even ```join -t $'\t' -j 1 -a 1 -a 2 file_1 file_2 -o auto``` – bob dylan Dec 13 '19 at 09:45
  • Bah I can see you've edited your comment. Never mind at least you got there in the end. – bob dylan Dec 13 '19 at 09:48
2

1st Solution: In case you do have duplicate values of $1 in your Input_file(s) then following will take care of that case also.

awk '
BEGIN{
  OFS="\t"
}
FNR==NR{
  a[$1]=$2
  next
}
$1 in a{
  print $1,a[$1],$2
  c[$1]
  next
}
{
  b[$1]=$2
}
END{
  for(i in a){
    if(!(i in c)){
      print i,a[i],"\t"
    }
  }
  for(j in b){
    print j,"\t",b[j]
  }
}
' Input_file2  Input_file1


2nd solution: Could you please try following in case you are NOT worried about order of output. You need not to run these many commands, you could simply pass your Input_files to this code.

awk '
BEGIN{
  OFS="\t"
}
FNR==NR{
  a[$1]=$2
  next
}
$1 in a{
  print $1,a[$1],$2
  delete a[$1]
  next
}
{
  b[$1]=$2
}
END{
  for(i in a){
    print i,a[i],"\t"
  }
  for(j in b){
    print j,"\t",b[j]
  }
}
' file2 file1
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93