Sorting biomolecules according to their energy

Question

I have a file which contains 7000 molecules, and their name and energies. Each molecule starts with keyword MODEL 1, second line has energy (-9.102 in below example, first molecule) and 7th line has name of the molecule (S3670 Cefsulodin (sodium).cdx in below example, first molecule). I want to rank/sort all molecules according to their energies such that lowest (most negative) will be first molecule in a resulting text file along with molecule's name. Energy and names could be on same or different lines. I thought to use the grep for parsing but have no experience about sorting according to a value embedded in a sentences. Can somebody please help. Thank you.

MODEL 1
REMARK VINA RESULT:    -9.102      0.000      0.000
REMARK INTER + INTRA:         -13.194
REMARK INTER:                 -12.767
REMARK INTRA:                  -0.427
REMARK UNBOUND:                 0.165
REMARK  Name = S3670 Cefsulodin (sodium).cdx
REMARK  8 active torsions:
REMARK  status: ('A' for Active; 'I' for Inactive)
REMARK    1  A    between atoms: CA_3  and  C_8
REMARK    2  A    between atoms: CA_5  and  N_10
REMARK    3  A    between atoms: C_7  and  C_12
REMARK    4  A    between atoms: C_12  and  N_16
REMARK    5  A    between atoms: C_15  and  C_17
REMARK    6  A    between atoms: C_17  and  C_21
REMARK    7  A    between atoms: C_17  and  S_22
REMARK    8  A    between atoms: C_30  and  C_33
REMARK                            x       y       z     vdW  Elec       q    Type
REMARK                         _______ _______ _______ _____ _____    ______ ____
ROOT
ATOM      1  N   UNL     1      92.970 106.706  73.996  0.00  0.00    +0.000 N
ATOM      2  C   UNL     1      93.751 107.062  75.160  0.00  0.00    +0.000 C
MODEL 1
REMARK VINA RESULT:    -6.812      0.000      0.000
REMARK INTER + INTRA:         -12.561
REMARK INTER:                 -11.387
REMARK INTRA:                  -1.175
REMARK UNBOUND:                -1.767
REMARK  Name = S3836 6-Gingerol.cdx
REMARK  10 active torsions:
REMARK  status: ('A' for Active; 'I' for Inactive)
REMARK    1  A    between atoms: C_1  and  C_2
REMARK    2  A    between atoms: C_1  and  C_12
REMARK    3  A    between atoms: C_2  and  C_3
REMARK    4  A    between atoms: C_3  and  C_4
REMARK    5  A    between atoms: C_4  and  C_5
REMARK    6  A    between atoms: C_5  and  C_6
REMARK    7  A    between atoms: C_6  and  C_7
REMARK    8  A    between atoms: C_7  and  C_8
REMARK    9  A    between atoms: C_8  and  C_9
REMARK   10  A    between atoms: C_14  and  O_18
REMARK                            x       y       z     vdW  Elec       q    Type
REMARK                         _______ _______ _______ _____ _____    ______ ____
ROOT
ATOM      1  C   UNL     1      89.880 102.122  75.634  0.00  0.00    +0.000 C
ENDROOT

Please provide the expected result with the shown sample and what you have tried (including a research) so far. Even if you are unfamiliar with sorting, you could start with parsing the data and provide your interim code. BR. — tshiono, Oct 21 '22 at 02:49
A *Decorate, Sort, Undecorate* approach should work fine with `awk` doing to decorating and piping it's output to `sort` and then piping the sorted output to either `awk` again or `sed` to undecorate. You should be able to search `"decorate sort undecorate"` above and come up with a few examples. Essentially you are combining the wanted lines for each molecule into a single line with some token representing the line breaks and decorate with the energy to sort by as a new 1st column. You sort on the 1st column and then undecorate by removing the 1st column and split the lines on your token. — David C. Rankin, Oct 21 '22 at 03:09

score 1 · Accepted Answer · answered Oct 21 '22 at 03:06

You would have to do the following steps:

Split the file at the at "MODEL 1", so that you get 1 model per file
Write a function or script that takes in a single model's data, and prints out a sortable string like ENERGY MODEL_NAME
Apply the function from 2 to every file, to obtain a result where every line is a molecule energy and name
Sort this

All of these are thoroughly explained elsewhere, but here are some common Unix commands you can use:

grep, sed, awk and many others can do this How do I split a string on a delimiter in Bash?
head/tail/cut or sed Getting n-th line of text output and Bash + sed/awk/cut to delete nth character
xargs or parallel https://savannah.gnu.org/projects/parallel/
sort https://en.wikipedia.org/wiki/Sort_(Unix)

That said, this is an excellent example of a task that is extremely painful to do in Bash, but a breeze in a better language like Python. Just read the file from STDIN, .split("MODEL 1").splitlines(), and use elementary Python string/list slicing/indexing to pull out your data. You can sort in Python too.

Sorting biomolecules according to their energy

1 Answers1