I need to parse the output of a chemistry program run with different parameters and combine the information of interest in a specific format.
Each output file from the program look like the following table, it gives the population of protonated and unprotonated species (residue) at a particular pH (here it is at pH=0):
Residue Number State 0 State 1 State 2 State 3 State 4
-----------------------------------------------------------------------------------
Residue: GL4 7 0.000410 (0) 0.453512 (1) 0.004275 (1) 0.535908 (1) 0.005895 (1)
Residue: HIP 15 0.900000 (2) 0.080000 (1) 0.020000 (1)
Residue: AS4 18 0.010085 (0) 0.486042 (1) 0.004335 (1) 0.495922 (1) 0.003615 (1)
Residue: GL4 35 0.000000 (0) 0.581343 (1) 0.000360 (1) 0.368002 (1) 0.050295 (1)
Residue: AS4 48 0.022640 (0) 0.520073 (1) 0.018440 (1) 0.425152 (1) 0.013695 (1)
Residue: AS4 52 0.038725 (0) 0.517533 (1) 0.113676 (1) 0.280601 (1) 0.049465 (1)
Residue: AS4 66 1.000000 (0) 0.000000 (1) 0.000000 (1) 0.000000 (1) 0.000000 (1)
Residue: AS4 87 0.004295 (0) 0.439747 (1) 0.010535 (1) 0.524678 (1) 0.020745 (1)
Residue: AS4 101 0.000105 (0) 0.504673 (1) 0.013110 (1) 0.478517 (1) 0.003595 (1)
Residue: AS4 119 0.014240 (0) 0.488767 (1) 0.007100 (1) 0.483272 (1) 0.006620 (1)
I have one file like this for each pH (all files have the exact same residues and states, only the population changes). Now I would like to extract the deprotonated fraction for all residues. The deprotonated fraction correspond to the populations that have a (0) after their number: for example, in the case of GL4 7 at pH=0 it is 0.000410 (which correspond to state 0) and for AS4 66, it is 1.00000. In fact it is state 0 for all residue EXCEPT for HIP 15: in this case the deprotonated fraction is indicated with (1) and corresponds to state 1 and 2. In the example above it is 0.080000 + 0.020000 = 0.1.
I then need to combine this information from the different files into a single file which look like this:
# pH GLU7 HIS15 ASP18 GLU35 ASP48 ASP52 ASP66 ASP87 ASP101 ASP119
0.000 0.000 0.100 0.010 0.000 0.023 0.039 1.000 0.004 0.000 0.014
1.000 0.006 0.140 0.098 0.000 0.276 0.312 1.000 0.015 0.002 0.069
Each column correspond to a residue, and each row to a pH (i.e the information from a single file, here I just show the information from two files).
I tried to come up with some awk one-liner but I am a beginner and I am not sure how to proceed. Actually, I don't know if awk is the best tool for this job. Perhaps sed and grep or python would be better. I will need to do this kind of parsing several time with a number of different outputs (but which all look the same although the residues will change) so I would like to have a way to make this automated but with some flexibility.
Please do not hesitate if you have any suggestion or comments, I would really appreciate if you can help me in sorting this problem.
Many thanks in advance!