awk: log processing based on multiple patterns

Question

I am working with the log filles consisted of some measurements taken from different samples (identified as float numbers 1.1, 1.2 ... 1.14) that are arranged in the following format:

Finding intramodel H-bonds
Constraints relaxed by 0.5 angstroms and 20 degrees
Models used:
    1.1 SarsCov2_structure19R_nsp5holo_rep1.pdb
    1.2 SarsCov2_structure19R_nsp5holo_rep1.pdb
    1.3 SarsCov2_structure19R_nsp5holo_rep1.pdb
    1.4 SarsCov2_structure19R_nsp5holo_rep1.pdb
    1.5 SarsCov2_structure19R_nsp5holo_rep1.pdb
    1.6 SarsCov2_structure19R_nsp5holo_rep1.pdb
    1.7 SarsCov2_structure19R_nsp5holo_rep1.pdb
    1.8 SarsCov2_structure19R_nsp5holo_rep1.pdb
    1.9 SarsCov2_structure19R_nsp5holo_rep1.pdb
    1.10 SarsCov2_structure19R_nsp5holo_rep1.pdb
    1.11 SarsCov2_structure19R_nsp5holo_rep1.pdb
    1.12 SarsCov2_structure19R_nsp5holo_rep1.pdb
    1.13 SarsCov2_structure19R_nsp5holo_rep1.pdb
    1.14 SarsCov2_structure19R_nsp5holo_rep1.pdb

16 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
SarsCov2_structure19R_nsp5holo_rep1.pdb #1.1/? HIS 163 NE2   SarsCov2_structure19R_nsp5holo_rep1.pdb #1.1/A UNL 888 S   no hydrogen                                                   3.850  N/A
SarsCov2_structure19R_nsp5holo_rep1.pdb #1.1/? GLU 166 N     SarsCov2_structure19R_nsp5holo_rep1.pdb #1.1/A UNL 888 O   SarsCov2_structure19R_nsp5holo_rep1.pdb #1.1/? GLU 166 H      2.909  2.070
SarsCov2_structure19R_nsp5holo_rep1.pdb #1.1/A UNL 888 N     SarsCov2_structure19R_nsp5holo_rep1.pdb #1.1/? CYS 44 O    SarsCov2_structure19R_nsp5holo_rep1.pdb #1.1/A UNL 888 H      2.798  1.892
SarsCov2_structure19R_nsp5holo_rep1.pdb #1.2/? GLN 189 NE2   SarsCov2_structure19R_nsp5holo_rep1.pdb #1.2/A UNL 888 S   SarsCov2_structure19R_nsp5holo_rep1.pdb #1.2/? GLN 189 1HE2   3.896  2.916
SarsCov2_structure19R_nsp5holo_rep1.pdb #1.3/? GLU 166 N     SarsCov2_structure19R_nsp5holo_rep1.pdb #1.3/A UNL 888 O   SarsCov2_structure19R_nsp5holo_rep1.pdb #1.3/? GLU 166 H      2.673  1.892
SarsCov2_structure19R_nsp5holo_rep1.pdb #1.3/A UNL 888 N     SarsCov2_structure19R_nsp5holo_rep1.pdb #1.3/? CYS 44 O    SarsCov2_structure19R_nsp5holo_rep1.pdb #1.3/A UNL 888 H      3.071  2.338
SarsCov2_structure19R_nsp5holo_rep1.pdb #1.4/? HIS 163 NE2   SarsCov2_structure19R_nsp5holo_rep1.pdb #1.4/A UNL 888 S   no hydrogen                                                   3.927  N/A
SarsCov2_structure19R_nsp5holo_rep1.pdb #1.4/A UNL 888 N     SarsCov2_structure19R_nsp5holo_rep1.pdb #1.4/? THR 190 O   SarsCov2_structure19R_nsp5holo_rep1.pdb #1.4/A UNL 888 H      3.029  2.173
SarsCov2_structure19R_nsp5holo_rep1.pdb #1.8/? GLN 189 NE2   SarsCov2_structure19R_nsp5holo_rep1.pdb #1.8/A UNL 888 S   SarsCov2_structure19R_nsp5holo_rep1.pdb #1.8/? GLN 189 2HE2   3.631  2.751
SarsCov2_structure19R_nsp5holo_rep1.pdb #1.9/? CYS 145 N     SarsCov2_structure19R_nsp5holo_rep1.pdb #1.9/A UNL 888 O   SarsCov2_structure19R_nsp5holo_rep1.pdb #1.9/? CYS 145 H      2.966  2.210
SarsCov2_structure19R_nsp5holo_rep1.pdb #1.9/A UNL 888 N     SarsCov2_structure19R_nsp5holo_rep1.pdb #1.9/? ARG 188 O   SarsCov2_structure19R_nsp5holo_rep1.pdb #1.9/A UNL 888 H      3.067  2.307
SarsCov2_structure19R_nsp5holo_rep1.pdb #1.10/? GLN 189 NE2  SarsCov2_structure19R_nsp5holo_rep1.pdb #1.10/A UNL 888 S  SarsCov2_structure19R_nsp5holo_rep1.pdb #1.10/? GLN 189 2HE2  3.693  2.786
SarsCov2_structure19R_nsp5holo_rep1.pdb #1.11/A UNL 888 N    SarsCov2_structure19R_nsp5holo_rep1.pdb #1.11/? THR 190 O  SarsCov2_structure19R_nsp5holo_rep1.pdb #1.11/A UNL 888 H     3.159  2.268
SarsCov2_structure19R_nsp5holo_rep1.pdb #1.12/? GLU 166 N    SarsCov2_structure19R_nsp5holo_rep1.pdb #1.12/A UNL 888 O  SarsCov2_structure19R_nsp5holo_rep1.pdb #1.12/? GLU 166 H     2.648  1.817
SarsCov2_structure19R_nsp5holo_rep1.pdb #1.13/A UNL 888 N    SarsCov2_structure19R_nsp5holo_rep1.pdb #1.13/? THR 190 O  SarsCov2_structure19R_nsp5holo_rep1.pdb #1.13/A UNL 888 H     3.176  2.395
SarsCov2_structure19R_nsp5holo_rep1.pdb #1.14/A UNL 888 N    SarsCov2_structure19R_nsp5holo_rep1.pdb #1.14/? PHE 140 O  SarsCov2_structure19R_nsp5holo_rep1.pdb #1.14/A UNL 888 H     2.833  1.955

I need to print the number assosiated with the sample (1-14) that should be correspond to the first occurence of two patterns: the "GLU 166 N" as well as "CYS 44 O" and no other patterns within the same sample. I need to print the number present on the same line just before the pattern as #1.number/?, associated with this pattern. So in the example the detected number should be 3 (since the associating number is #1.3/?) where the both patterns (and no others!) could be found. Finally if the both patterns could not be found I would like to print the number corresponded to the sample with the first pattern "GLU 166 N" (like in my example)

Presently my AWK solution is focused on one pattern-based search: looking the first occurence of the "GLU 166 N" ( in the case if the pattern can not be found the script prints 1 ). Basically, it looks for the "pattern" anywhere on the line, and then prints the second part of the number (after the dot) from the 2nd field":


awk -vn=1 '/GLU 166 N/ {gsub(/.*\.|\/\?/,"",$2); n=$2; exit} END {print n}' input.log

Would you please elaborate about the relationship of the `two patterns`? As the patterns "GLU 166 N" and "CYS 44 O" appear on different lines, I don't see how they are associated with the number `3`. Besides, I'm afraid I cannot understand the meaning of `no other patterns within the same sample`. Or do you want to extract the number which is common with the two patterns? — tshiono, Apr 04 '22 at 23:22
yes sure. The both patterns may be indeed found only in different strings. They may belong to the same sample (ID) defined in the log as #1.1, #1.2, #1.3 ... #1.14. The goal is to print the number (after .) of the ID where the both patterns could be found. In this example it correspond to the ID #1.3 which has only this two patterns, so we need to print 3. For example #1.1 also has the both searching patterns but there is also "HIS 163 NE2", which should exclude the 1 from the results.. — James Starlight, Apr 05 '22 at 09:09
Thank you for the response. I suppose I'm gradually understanding. One more question on your update. You mention `where the both patterns (and no others!) could be found` but the number `3` is also included in other pattern: `GLU 166 H`. Am I still misunderstanding? — tshiono, Apr 05 '22 at 11:14
Another question based on the analysis. It looks `#1.1/? GLU 166 N` and `#1.1/? CYS 44 O` appear first. Why don't we pick `1` as the answer, not `3`. — tshiono, Apr 05 '22 at 11:33
**EDIT** This may because the posted `input.log` is the older one. If I test with the file posted in your previous question, `3` will be the correct answer. — tshiono, Apr 05 '22 at 11:55

tshiono · Accepted Answer · 2022-04-06T12:09:31.807

1

Based our meaningful discussion, would you please try:

awk -F# '                               # split line on '#' into fields
{
    for (i = 1; i <= NF; i++) {         # loop over the fields
        if (match($i, /^1\.[0-9]+\/\? GLU 166 N/)) {
            sub(/^1\./, "", $i); sub(/\/.*/, "", $i)
                                        # extract the number after "1." in $i
            if (first == "") first = $i # keep the first found value as a fallback
            if ($i in b) {              # if the number exists also in b
                queue[++qn] = $i        # then push it in the queue
            }
            a[$i]
            next
        } else if (match($i, /^1\.[0-9]+\/\? CYS 44 O/)) {
            sub(/^1\./, "", $i); sub(/\/.*/, "", $i)
            if ($i in a) {
                queue[++qn] = $i
            }
            b[$i]
            next
        } else if (match($i, /^1\.[0-9]+\/\? [A-Z]{3} [0-9]+ [A-Z][A-Z0-9]*/)) {
                                        # analyse other patterns
            sub(/^1\./, "", $i); sub(/\/.*/, "", $i)
            exclude[$i]
        }
    }
}
END {
    for (i = 1; i <= qn; i++) {         # examine the queue in appearance order
        j = queue[i]                    # j is the matched number
        if (! (j in exclude)) {         # if not found in other patterns
            print j                     # then it is the answer
            exit
        }
    }
    if (first == "") print "1"          # the default value
    else print first                    # the fallback
}' input.log

It searches for the names: GLU 166 N, CYS 44 O and other substances as well as the associated numbers embedded in the leading form #1.<num>/?.
If both GLU 166 N and CYS 44 O have the same number, the number is pushed in queue in appearance order.
We need to eliminate numbers which also appear with other substances (except for the case it appears in the same line after either of the two). The array exclude memorizes the numbers associated with these substances.
In the END block we examine the numbers in queue in order. The first number in the queue which is not included in exclude will be used as the answer.
If GLU 166 N and CYS 44 O do not have the same number, the first found number with GLU 166 N is used as a fallback.
As a last resort, 1 will be used in case no pattern is found.

[EDIT]
Here is a one liner to use bash variables as the search patterns and assign a bash variable var to the output:

search_pattern1='GLU 166 N'
search_pattern2='CYS 44 O'

var=$(awk -F# -v pat1="$search_pattern1" -v pat2="$search_pattern2" '{for (i = 1; i <= NF; i++) {if (match($i, "^1\\.[0-9]+\\/\\? "pat1)) {sub(/^1\./, "", $i); sub(/\/.*/, "", $i); if (first == "") first = $i; if ($i in b) {queue[++qn] = $i} a[$i]; next} else if (match($i, "^1\\.[0-9]+\\/\\? "pat2)) {sub(/^1\./, "", $i); sub(/\/.*/, "", $i); if ($i in a) {queue[++qn] = $i} b[$i]; next} else if (match($i, /^1\.[0-9]+\/\? [A-Z]{3} [0-9]+ [A-Z][A-Z0-9]*/)) { sub(/^1\./, "", $i); sub(/\/.*/, "", $i); exclude[$i]}}} END {for (i = 1; i <= qn; i++) {j = queue[i]; if (! (j in exclude)) {print j; exit}} if (first == "") print "1"; else print first}' input.log)

[Explanations]
When using the variable as the regex pattern in awk, we need to take care of the quoting. In many cases we will use the statement as:

if (match($0, /regex/)) ...

where the slashes are used as quotes to enclose the regex pattern. Bare words within the quotes are treated as literal strings, not the variable name. That is why we cannot put the variable name within the quotes. For instance, if we say:

if (match($i, /^1\.[0-9]+\/\? pat1/)) ...

the word pat1 is no longer the variable name. It's just a literal string.

How can we solve it? We need to put the quoted string and the variable side by side so the awk concatenate them into a single pattern:

if (match($i, "^1\\.[0-9]+\\/\\? "pat1)) ...

In this case we cannot use the slash quotes. Instead we need to use double quotes.
We need to add another backslash to escape the backslash within the double quotes.

BTW your post "How can I use bash variables in my Awk script?" has unfortunately been closed due to the duplicate. However the essential problem is not there. I'm afraid the reviewers did not understand what you want to do.

edited Apr 06 '22 at 12:09

answered Apr 05 '22 at 12:02

tshiono

21,248
2
14
22

just a question, in order to test it I would like to put your awk code into my bash workflow and assign the output number to the variable like var=$(awk -F# ' ... ' input.log) ... I see that the script work but the part of the code after the AWK change the colour in my editor. probably I should add something more ? – James Starlight Apr 05 '22 at 13:38
.. before I did it with one-liner synatax var=$(awk -vn=${default_state} '/GLU 166 N/ {gsub(/.*\.|\/\?/,"",$2); n=$2; exit} END {print n}' input.log) – James Starlight Apr 05 '22 at 13:40
You can wrap the whole code with parentheses in the same manner such as `var=$(awk ... input.log)` across the lines. – tshiono Apr 05 '22 at 13:45
well may be I do something wrong... it works ok but I have a feeling that the format is not good. should I transform the code in one line format ? if so, may you please add to the answer the version where I related the awk output to the bash variable ? – James Starlight Apr 05 '22 at 13:52
Sure, I've updated my answer with the variable-assignment version. Hope it will work. My concern is the one-liner is almost unreadable and unmaintainable. – tshiono Apr 05 '22 at 14:05
well, I've just tested it (mostly using one-line sollution since it is easy to incorporate into bash script to test with many log filles) and found that it works like mine version- it detects correctly the first pattern "GLU 166 N" but never takes into account the second one although it's been mentioned in the script, so we need a bug fix :-) – James Starlight Apr 05 '22 at 15:08
sorry I just add that the swcipt should print the ID where the both patterns are found and no other patterns are present. This is why in the above example, the number 3 should be printed instead of the 1 since in the 1.1 there is additionally the pattern "HIS 163 NE2" and in the 1.3 there are only two search patterns. I am sorry if it was not so clean in my first message I am going to edit it – James Starlight Apr 05 '22 at 15:28
It is related to my previous question in the comment: `but the number 3 is also included in other pattern: GLU 166 H`. As for the `1.3` there still is another pattern `GLU 166 H` besides "GLU 166 N" and "CYS 44 O". Or should I skip `GLU 166 H` because it appears in the same line as `GLU 166 N`? – tshiono Apr 05 '22 at 20:27
yes exactly this is true! we need only consider only the first apperance of the pattern in each line (it the 3rd column) sorry that I did not clarify it ... And the global idea is to find 2 patterns that are ONLY belongs to the ID (and ain't no other patterns!) – James Starlight Apr 06 '22 at 07:43
Thank you for the response. I've fixed my code accordingly as well as the one-liner version. Would you please test it again? BR. – tshiono Apr 06 '22 at 08:20
It works! BTW (last question!) I had another difficulty related to the further accordance of the AWK part into my bash script where I would like to define the both search patterns as the external variables. https://stackoverflow.com/questions/71754834/how-can-i-use-bash-variables-in-my-awk-script – James Starlight Apr 06 '22 at 08:56
1

Okay, I've understood your problem. I need to modify the regex and the match function. Would you please allow me a while to come back? Cheers. – tshiono Apr 06 '22 at 09:37
yes sure this is not urgent feel free to post your update when you are free. My version with your last AWK script (which does not work) may be found in the link that I sent! – James Starlight Apr 06 '22 at 10:01
I have updated my answer with the one-liner version which accepts bash variables as search patterns. Please take a look of **[EDIT]** section. Unfortunately your another question seems to have been closed. Sigh... – tshiono Apr 06 '22 at 12:14
1

Thank you very much for this very useful help! I gotcha! this was the issue related to the quoting of the expression in the AWK part not the bash variables provided in the begining of the code .. many thanks anyway again! – James Starlight Apr 06 '22 at 12:55
1

My pleasure. If you encounter any issue, please feel free to ask again. Cheers. – tshiono Apr 06 '22 at 13:24
1

Thank you! Indeed, I would be happy to create a new topic and share it with you! Have a nice day and take cate – James Starlight Apr 06 '22 at 13:41

awk: log processing based on multiple patterns

1 Answers1

Linked