0

I'm trying the cross two files using their location.
f1:

Location    Consequence SYMBOL  Feature gnomAD_AF   gnomAD_AFR_AF   gnomAD_AMR_AF   gnomAD_ASJ_AF   gnomAD_EAS_AF   gnomAD_FIN_AF   gnomAD_NFE_AF   gnomAD_OTH_AF   gnomAD_SAS_AF   CLIN_SIG    CADD_phred  CADD_raw    CADD_raw_rankscore  REVEL_rankscore REVEL_score clinvar_clnsig  clinvar_golden_stars
1:45330550-45330550 missense_variant    MUTYH   NM_001128425.1  2.541e-05   0   0   0   5.945e-05   0   2.818e-05   0   6.821e-05   uncertain_significance  23.7    4.061544    0.54541 0.76110 0.461   -   -
1:45331556-45331556 missense_variant,splice_region_variant  MUTYH   NM_001128425.1  0.002958    0.0007277   0.003068    0.0002038   0   0.002182    0.004831    0.003839    9.747e-05   likely_pathogenic,pathogenic    29.4    6.349794    0.87691 0.99202 0.954   5,5,5,5,5,5,5   2,0,2,2,0,0,0

f2:

chromosome  start   stop    ref alt
12  132668439   132668439   G   A
17  7673593 7673593 G   C

I managed to do it using this:

awk -v OFS="\t" 'NR==1{h1=$0}NR==FNR{arr[$1":"$2"-"$3] = $0; next}FNR==1{print h1, $0}NR>FNR{if($1 in arr){print arr[$1], $0}}' f2 f1 > res

However I have a newline in the middle of every line just after printing the h1 or arr[$1] and i don't understand why.

chromosome  start   stop    ref alt
    Location    Consequence SYMBOL  Feature gnomAD_AF   gnomAD_AFR_AF   gnomAD_AMR_AF   gnomAD_ASJ_AF   gnomAD_EAS_AF   gnomAD_FIN_AF   gnomAD_NFE_AF   gnomAD_OTH_AF   gnomAD_SAS_AF   CLIN_SIG    CADD_phred  CADD_raw    CADD_raw_rankscore  REVEL_rankscore REVEL_score clinvar_clnsig  clinvar_golden_stars
1   45330550    45330550    C   T
    1:45330550-45330550 missense_variant    MUTYH   NM_001128425.1  2.541e-05   0   0   0   5.945e-05   0   2.818e-05   0   6.821e-05   uncertain_significance  23.7    4.061544    0.54541 0.76110 0.461   -   -
1   45331556    45331556    C   T

I have even tried using individual variables to print h1 but the problem still persisted.

Any insights?

Yujin Kim
  • 141
  • 1
  • 10

1 Answers1

0

I think we are missing a couple next statements? Hopefully also the following repaired code has a formatting that will help clarify and make the code more understandable:

awk '
    BEGIN       { OFS = "\t"; h1 = ""; split("", arr) }
                { $1 = $1 }
    NR  == 1    { h1 = $0;                next }
    FNR == 1    { print h1, $0;           next }
    NR  == FNR  { arr[$1":"$2"-"$3] = $0; next }
    ($1 in arr) { print arr[$1], $0 }
    ' f2 f1 > res

If we want FS = OFS = "\t" we can specify this in the BEGIN section and get rid of the { $1 = $1 } reformatting the buffer for tab delimited output.

Michael Back
  • 1,821
  • 1
  • 16
  • 17