1

I have searched everywhere but I still don't have the answer that I'm looking for. I have the following pdb file (file1):

ATOM      1  N   SER A   1      31.848  -5.217  38.114  1.00 39.55
ATOM      2  CA  SER A   1      31.668  -5.130  36.630  1.00 40.83
ATOM      3  C   SER A   1      30.991  -3.833  36.183  1.00 40.24
ATOM      4  O   SER A   1      30.868  -2.883  36.961  1.00 40.08
ATOM      5  CB  SER A   1      30.854  -6.329  36.118  1.00 41.46
ATOM      6  OG  SER A   1      31.600  -7.531  36.190  1.00 44.54
ATOM      7  N   THR A   2      30.605  -3.796  34.906  1.00 39.92
ATOM      8  CA  THR A   2      29.920  -2.658  34.286  1.00 38.97
ATOM      9  C   THR A   2      28.542  -3.116  33.777  1.00 38.40
ATOM     10  O   THR A   2      27.815  -2.341  33.141  1.00 38.79
ATOM     11  CB  THR A   2      30.734  -2.067  33.086  1.00 39.67
ATOM     12  OG1 THR A   2      31.045  -3.101  32.139  1.00 38.83
ATOM     13  CG2 THR A   2      32.020  -1.403  33.566  1.00 38.83

I also have the following file after some calculation using gfortran (file2):

1  0.14364205034979632
2  0.50527753403393372

What I'd like to do is replace column 11 of file1 with column 2 of file2 for as long as column 6 of file1 is equal to column 1 of file2. Essentially, the output should be like this:

ATOM      1  N   SER A   1      31.848  -5.217  38.114  1.00 0.14364205034979632
ATOM      2  CA  SER A   1      31.668  -5.130  36.630  1.00 0.14364205034979632
ATOM      3  C   SER A   1      30.991  -3.833  36.183  1.00 0.14364205034979632
ATOM      4  O   SER A   1      30.868  -2.883  36.961  1.00 0.14364205034979632
ATOM      5  CB  SER A   1      30.854  -6.329  36.118  1.00 0.14364205034979632
ATOM      6  OG  SER A   1      31.600  -7.531  36.190  1.00 0.14364205034979632
ATOM      7  N   THR A   2      30.605  -3.796  34.906  1.00 0.50527753403393372
ATOM      8  CA  THR A   2      29.920  -2.658  34.286  1.00 0.50527753403393372
ATOM      9  C   THR A   2      28.542  -3.116  33.777  1.00 0.50527753403393372
ATOM     10  O   THR A   2      27.815  -2.341  33.141  1.00 0.50527753403393372
ATOM     11  CB  THR A   2      30.734  -2.067  33.086  1.00 0.50527753403393372
ATOM     12  OG1 THR A   2      31.045  -3.101  32.139  1.00 0.50527753403393372
ATOM     13  CG2 THR A   2      32.020  -1.403  33.566  1.00 0.50527753403393372

I have the following code:

gawk '
FNR==NR { pdb[NR]=$0; next }
{
    split(pdb[FNR],flds,FS,seps)

    while ( flds[6]==$1 ) {
    flds[11]=$2
    for (i=1;i in flds;i++)
        printf "%s%s", flds[i], seps[i]
    print ""
    }
}
' "file1" "file2" > "output.pdb"

It gets the job done for the first line of file1 and it keeps the spacing consistent. The problem is that it doesn't proceed to the next lines and the first line is also repeated perpetually. Could anyone be so kind to help me out?

Thanks! I'd treat you for some beer :)

  • You are quite confused about how awk works and it's syntax. I recommend the book Effective Awk Programming, 4th Edition, by Arnold Robbins to start learning about awk. – Ed Morton Jul 17 '16 at 14:33
  • 1
    @EdMorton: I'm actually new to using awk and I just got that piece of code from the internet. If I'd have the time, I'd love to check out your suggested reference. Cheers!~ – ajthealchemist Jul 17 '16 at 15:51
  • There's **FAR** more bad code than good code on the internet. Make sure you check out the source before trying to use any code you find online as it'll mostly be buggy at best, and often dangerous. SO is littered with accepted "answers" that will wipe your file system or similar given some inputs. Or it might be good for one application but bad for yours. And, of course, you should always read a recommended/trusted source first to get a basic understanding of any tool/language you plan to use so you stand some chance of separating the good code from the bad. – Ed Morton Jul 17 '16 at 15:55

3 Answers3

1

I assume that file1 is sorted by column 6.

join -1 6 -2 1 file1 file2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,2.2 | column -t

Output:

ATOM  1   N    SER  A  1  31.848  -5.217  38.114  1.00  0.14364205034979632
ATOM  2   CA   SER  A  1  31.668  -5.130  36.630  1.00  0.14364205034979632
ATOM  3   C    SER  A  1  30.991  -3.833  36.183  1.00  0.14364205034979632
ATOM  4   O    SER  A  1  30.868  -2.883  36.961  1.00  0.14364205034979632
ATOM  5   CB   SER  A  1  30.854  -6.329  36.118  1.00  0.14364205034979632
ATOM  6   OG   SER  A  1  31.600  -7.531  36.190  1.00  0.14364205034979632
ATOM  7   N    THR  A  2  30.605  -3.796  34.906  1.00  0.50527753403393372
ATOM  8   CA   THR  A  2  29.920  -2.658  34.286  1.00  0.50527753403393372
ATOM  9   C    THR  A  2  28.542  -3.116  33.777  1.00  0.50527753403393372
ATOM  10  O    THR  A  2  27.815  -2.341  33.141  1.00  0.50527753403393372
ATOM  11  CB   THR  A  2  30.734  -2.067  33.086  1.00  0.50527753403393372
ATOM  12  OG1  THR  A  2  31.045  -3.101  32.139  1.00  0.50527753403393372
ATOM  13  CG2  THR  A  2  32.020  -1.403  33.566  1.00  0.50527753403393372

Update:

With bash's printf:

printf "%s %6.d  %-3s %s %s   %s      %s  %s  %s  %s %s\n" $(join -1 6 -2 1 file1 file2 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,2.2)

Output:

ATOM      1  N   SER A   1      31.848  -5.217  38.114  1.00 0.14364205034979632
ATOM      2  CA  SER A   1      31.668  -5.130  36.630  1.00 0.14364205034979632
ATOM      3  C   SER A   1      30.991  -3.833  36.183  1.00 0.14364205034979632
ATOM      4  O   SER A   1      30.868  -2.883  36.961  1.00 0.14364205034979632
ATOM      5  CB  SER A   1      30.854  -6.329  36.118  1.00 0.14364205034979632
ATOM      6  OG  SER A   1      31.600  -7.531  36.190  1.00 0.14364205034979632
ATOM      7  N   THR A   2      30.605  -3.796  34.906  1.00 0.50527753403393372
ATOM      8  CA  THR A   2      29.920  -2.658  34.286  1.00 0.50527753403393372
ATOM      9  C   THR A   2      28.542  -3.116  33.777  1.00 0.50527753403393372
ATOM     10  O   THR A   2      27.815  -2.341  33.141  1.00 0.50527753403393372
ATOM     11  CB  THR A   2      30.734  -2.067  33.086  1.00 0.50527753403393372
ATOM     12  OG1 THR A   2      31.045  -3.101  32.139  1.00 0.50527753403393372
ATOM     13  CG2 THR A   2      32.020  -1.403  33.566  1.00 0.50527753403393372
Cyrus
  • 84,225
  • 14
  • 89
  • 153
1

This is an incredibly common problem, I'm surprised you couldn't find a solution:

$ awk 'NR==FNR{a[$1]=$2;next} {$11=a[$6]} 1' file2 file1
ATOM 1 N SER A 1 31.848 -5.217 38.114 1.00 0.14364205034979632
ATOM 2 CA SER A 1 31.668 -5.130 36.630 1.00 0.14364205034979632
ATOM 3 C SER A 1 30.991 -3.833 36.183 1.00 0.14364205034979632
ATOM 4 O SER A 1 30.868 -2.883 36.961 1.00 0.14364205034979632
ATOM 5 CB SER A 1 30.854 -6.329 36.118 1.00 0.14364205034979632
ATOM 6 OG SER A 1 31.600 -7.531 36.190 1.00 0.14364205034979632
ATOM 7 N THR A 2 30.605 -3.796 34.906 1.00 0.50527753403393372
ATOM 8 CA THR A 2 29.920 -2.658 34.286 1.00 0.50527753403393372
ATOM 9 C THR A 2 28.542 -3.116 33.777 1.00 0.50527753403393372
ATOM 10 O THR A 2 27.815 -2.341 33.141 1.00 0.50527753403393372
ATOM 11 CB THR A 2 30.734 -2.067 33.086 1.00 0.50527753403393372
ATOM 12 OG1 THR A 2 31.045 -3.101 32.139 1.00 0.50527753403393372
ATOM 13 CG2 THR A 2 32.020 -1.403 33.566 1.00 0.50527753403393372

If you care about preserving the white space:

$ awk 'NR==FNR{a[$1]=$2;next} {sub(/[^[:space:]]+[[:space:]]*$/,a[$6])} 1' file2 file1
ATOM      1  N   SER A   1      31.848  -5.217  38.114  1.00 0.14364205034979632
ATOM      2  CA  SER A   1      31.668  -5.130  36.630  1.00 0.14364205034979632
ATOM      3  C   SER A   1      30.991  -3.833  36.183  1.00 0.14364205034979632
ATOM      4  O   SER A   1      30.868  -2.883  36.961  1.00 0.14364205034979632
ATOM      5  CB  SER A   1      30.854  -6.329  36.118  1.00 0.14364205034979632
ATOM      6  OG  SER A   1      31.600  -7.531  36.190  1.00 0.14364205034979632
ATOM      7  N   THR A   2      30.605  -3.796  34.906  1.00 0.50527753403393372
ATOM      8  CA  THR A   2      29.920  -2.658  34.286  1.00 0.50527753403393372
ATOM      9  C   THR A   2      28.542  -3.116  33.777  1.00 0.50527753403393372
ATOM     10  O   THR A   2      27.815  -2.341  33.141  1.00 0.50527753403393372
ATOM     11  CB  THR A   2      30.734  -2.067  33.086  1.00 0.50527753403393372
ATOM     12  OG1 THR A   2      31.045  -3.101  32.139  1.00 0.50527753403393372
ATOM     13  CG2 THR A   2      32.020  -1.403  33.566  1.00 0.50527753403393372
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Thanks for this answer, Ed. :) However, when I tried your second suggestion, it only displays the original file (file1) unmodified. – ajthealchemist Jul 17 '16 at 16:08
  • @ajthealchemist You may be using an old, non-POSIX awk which wouldn't recognize the POSIX character class `[:space:]`. Try replacing `[^[:space:]]` with `[^ \t]`. If that doesn't work tell us what `awk --version` outputs and post the output of `cat -v file | tr ' ' '#' | head -1` so we can check for trailing white space and/or control characters. – Ed Morton Jul 17 '16 at 16:10
  • @ajthealchemist actually, I see from your question you're using a new enough version of gawk that it supports the 4th arg to split() so it can't be the character class issue so just tell us the awk version and the cat output. You must have some control characters or something in your input file or you simple made a mistake when copy/pasting my script. Also try changing the sub() to allow for possible trailing whitespace: `sub(/[^[:space:]]+[[:space:]]*$/,a[$6])`. – Ed Morton Jul 17 '16 at 16:18
  • 1
    Thanks for that follow up answer. I'm currently using GNU Awk 4.1.3. The modification of sub() to `sub(/[^[:space:]]+[[:space:]]*$/,a[$6])` worked perfectly. Will edit your answer. – ajthealchemist Jul 17 '16 at 17:34
  • Unfortunately Abhijeet I won't be near Gujurat any time soon but if you're ever in the Chicago area .... ;-). – Ed Morton Jul 17 '16 at 20:29
0

This solution is gawk specific (see Defining Fields by Content) and assumes file2 to have two columns separated by single space to get output as per requirement

awk 'BEGIN {FPAT = "([[:space:]]*[[:alnum:][:punct:][:digit:]]+)"; OFS = "";} FNR==NR{a[$1]=$2; next} {$11=a[$6+0]} {print}' file2 file1 
  • {$11=a[$6+0]} so that values of $6 like " 1" and " 2" will match against values in array a like "1" and "2" in numeric context instead of string comparison (Thanks @Ed Morton for the explanation)

References:

Community
  • 1
  • 1
Sundeep
  • 23,246
  • 2
  • 28
  • 103
  • You're welcome. Don't do this using FPAT, though, it's the wrong approach as it's gawk-specific, cludgy, and limited in applicability while there's a far better general solution if you're using gawk (4th arg to split()) and it's necessary (which it isn't in this case). The wrong answer is accepted in the question you reference. – Ed Morton Jul 17 '16 at 15:16
  • 1
    thanks again, had just searched and stitched together an answer.. that was a learning experience and there's more to learn from your comments and answer... – Sundeep Jul 17 '16 at 15:28
  • Thanks for your answer @spasic. However, when I tried your answer, column 11 of file1 was removed. Please check the answer I've accepted and maybe you could give us some comments regarding it. Cheers! – ajthealchemist Jul 17 '16 at 18:03
  • @ajthealchemist I checked again, including adding white-spaces at end of line in file1.. it is working as per your sample input and output expected.. can you check again? – Sundeep Jul 18 '16 at 03:58