0

I need to parse the output of a chemistry program run with different parameters and combine the information of interest in a specific format.

Each output file from the program look like the following table, it gives the population of protonated and unprotonated species (residue) at a particular pH (here it is at pH=0):

   Residue Number     State  0     State  1     State  2     State  3     State  4
-----------------------------------------------------------------------------------
Residue: GL4 7    0.000410 (0) 0.453512 (1) 0.004275 (1) 0.535908 (1) 0.005895 (1)
Residue: HIP 15   0.900000 (2) 0.080000 (1) 0.020000 (1)
Residue: AS4 18   0.010085 (0) 0.486042 (1) 0.004335 (1) 0.495922 (1) 0.003615 (1)
Residue: GL4 35   0.000000 (0) 0.581343 (1) 0.000360 (1) 0.368002 (1) 0.050295 (1)
Residue: AS4 48   0.022640 (0) 0.520073 (1) 0.018440 (1) 0.425152 (1) 0.013695 (1)
Residue: AS4 52   0.038725 (0) 0.517533 (1) 0.113676 (1) 0.280601 (1) 0.049465 (1)
Residue: AS4 66   1.000000 (0) 0.000000 (1) 0.000000 (1) 0.000000 (1) 0.000000 (1)
Residue: AS4 87   0.004295 (0) 0.439747 (1) 0.010535 (1) 0.524678 (1) 0.020745 (1)
Residue: AS4 101  0.000105 (0) 0.504673 (1) 0.013110 (1) 0.478517 (1) 0.003595 (1)
Residue: AS4 119  0.014240 (0) 0.488767 (1) 0.007100 (1) 0.483272 (1) 0.006620 (1)

I have one file like this for each pH (all files have the exact same residues and states, only the population changes). Now I would like to extract the deprotonated fraction for all residues. The deprotonated fraction correspond to the populations that have a (0) after their number: for example, in the case of GL4 7 at pH=0 it is 0.000410 (which correspond to state 0) and for AS4 66, it is 1.00000. In fact it is state 0 for all residue EXCEPT for HIP 15: in this case the deprotonated fraction is indicated with (1) and corresponds to state 1 and 2. In the example above it is 0.080000 + 0.020000 = 0.1.

I then need to combine this information from the different files into a single file which look like this:

#     pH     GLU7    HIS15    ASP18    GLU35    ASP48    ASP52    ASP66    ASP87   ASP101   ASP119
   0.000    0.000    0.100    0.010    0.000    0.023    0.039    1.000    0.004    0.000    0.014
   1.000    0.006    0.140    0.098    0.000    0.276    0.312    1.000    0.015    0.002    0.069

Each column correspond to a residue, and each row to a pH (i.e the information from a single file, here I just show the information from two files).

I tried to come up with some awk one-liner but I am a beginner and I am not sure how to proceed. Actually, I don't know if awk is the best tool for this job. Perhaps sed and grep or python would be better. I will need to do this kind of parsing several time with a number of different outputs (but which all look the same although the residues will change) so I would like to have a way to make this automated but with some flexibility.

Please do not hesitate if you have any suggestion or comments, I would really appreciate if you can help me in sorting this problem.

Many thanks in advance!

ejl62
  • 101
  • 1
  • 2
  • 8
  • `awk` is not a good solution, as it always works only with one file at a time and cannot combine files. I recommend using Python `pandas` `DataFrame`s. – DYZ Dec 03 '16 at 23:50
  • Why is the deprotonated fraction indicated by `(1)` for `HIP 15`? Is the general rule that the deprotonated fraction is the sum of those states with the _minimum_ number as the indicator? – mhawke Dec 04 '16 at 09:04
  • @mhawke, yes indeed, it is a very good point: he deprotonated fraction is the sum of those states with the minimum number as the indicator. Could this be used somehow to extract the information of interest? – ejl62 Dec 05 '16 at 13:30

3 Answers3

0

you can cat all the files using a for loop to a file and use the previous solution from Stackoverflow to transpose the row to column.

An efficient way to transpose a file in Bash

Community
  • 1
  • 1
Vincent K
  • 61
  • 1
  • 5
  • Thanks but it is not a simple transposition, I need to extract specific bits from the original file – ejl62 Dec 05 '16 at 13:32
0

It's not completely clear what you want but python's split function could possibly be of use to you. If called without any arguments, it splits based on spaces (collating multiple spaces into one)

So this line for example,

Residue: GL4 7    0.000410 (0) 0.453512 (1) 0.004275 (1) 0.535908 (1) 0.005895 (1)

can be split like this,

a = 'Residue: GL4 7    0.000410 (0) 0.453512 (1) 0.004275 (1) 0.535908 (1) 0.005895 (1)'
l = a.split()
print l

['Residue:', 'GL4', '7', '0.000410', '(0)', '0.453512', '(1)', '0.004275', '(1)', '0.535908', '(1)', '0.005895', '(1)']

You can then access the values you want and work on them. Calling float and int on the strings (eg. float('0.00410') should convert them to numbers for you. For the '(1)', you can do int('(1)'[1:-1])

0

This awk script should get you started. In order to get the desired output, you will have to replace the filename with the corresponding pH value. And I omitted lines that contain no zero state, since you did not specify what to do with those.

/^   Residue/ || /^-----/ { next; }

{
    filenames[FILENAME] = 1;
    columns[$2 " " $3] = 1;
    for (i = 5; i <= NF; i = i + 2) {
        if ($i == "(0)") {
            data[$2 " " $3, FILENAME] = $(i-1);
        }
    }
}

END {
    printf("%10s", "filename");
    for (col in columns) {
        printf("%10s", col);
    }
    print "";
    for (filename in filenames) {
        printf("%10s", filename);
        for (col in columns) {
            printf("%10s", data[col, filename]);
        }
        print "";
    }
}
Michael Vehrs
  • 3,293
  • 11
  • 10