0

I am working with the log filles arranged in the following format:

fƒdfFinding intramodel H-bonds
Constraints relaxed by 0.5 angstroms and 20 degrees
Models used:
    1.1 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.2 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.3 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.4 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.5 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.6 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.7 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.8 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.9 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.10 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.11 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.12 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.13 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.14 SarsCov2_structure49R_nsp5holo_rep1.pdb

14 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.1/? ASN 142 ND2   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.1/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.1/? ASN 142 1HD2   3.102  2.145
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.3/? GLU 166 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.3/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.3/? GLU 166 H      3.011  2.024
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.4/? GLU 166 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.4/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.4/? GLU 166 H      3.037  2.132
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/? HIS 163 NE2   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/A UNL 888 O   no hydrogen                                                   3.388  N/A
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/? GLU 166 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/? GLU 166 H      2.806  1.792
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? THR 26 N      SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? THR 26 H       3.093  2.142
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? GLY 143 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? GLY 143 H      3.030  2.193
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.9/? GLN 189 NE2   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.9/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.9/? GLN 189 2HE2   3.052  2.301
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.10/? GLU 166 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.10/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.10/? GLU 166 H     2.854  1.868
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.12/? GLY 143 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.12/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.12/? GLY 143 H     3.103  2.070
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? GLY 143 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? GLY 143 H     3.161  2.224
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? CYS 145 SG   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? CYS 145 HG    3.421  2.842
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? ASN 142 ND2  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? ASN 142 2HD2  3.055  2.465
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? CYS 145 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? CYS 145 H     2.924  2.143

I need to find the first occurence of the "GLU 166 N" pattern and print the number present on the same line just before the pattern as #1.number/?, associated with this pattern. So in the example the detected number should be 3 (since the associating number is #1.3/?).

I would start from basic pattern-detection

awk '/GLU 166 N/' file

but how to find correctly the number defined just before the pattern and print it as output ? Finally, in the case if the pattern can not be found, I would like that the script prints 1.

jaco0646
  • 15,303
  • 7
  • 59
  • 83
James Starlight
  • 317
  • 1
  • 6
  • It looks like `GLU 166 N` could appear in 2 locations on a line and there's a number before each location but that number is always the same in both locations in a given line, e.g. you apparently could have "SarsCov2_structure49R_nsp5holo_rep1.pdb **#1.1/?** ASN 142 ND2 SarsCov2_structure49R_nsp5holo_rep1.pdb #1.1/A UNL 888 O SarsCov2_structure49R_nsp5holo_rep1.pdb **#1.1/? GLU 166 N** 3.102 2.145". Do need to check for `GLU 166 N` in multiple locations? Is the number that looks like `#1.1/?` always the same across the line? Can you ever have `GLU 166 N17` etc. that should NOT match? – Ed Morton Apr 01 '22 at 13:30
  • I'm asking because while some of us assumed and coded for the worst, the currently accepted answer assumes that `GLU 166 N17` or similar can never occur, or if it does you want it to match against `GLU 166 N`, and that if/when `GLU N66 N` occurs later in the line than the 3rd field, it's still OK to print the 2nd field in the line rather than the number that appears in the field immediately before `GLU 166 N` so it'd be good to know what your requirements are for those cases. – Ed Morton Apr 01 '22 at 13:35

4 Answers4

1
$ awk -vn=1 '/GLU 166 N/ {gsub(/.*\.|\/\?/,"",$2); n=$2; exit} END {print n}' file
3
$ awk -vn=1 '/GLU 166 N/ {gsub(/.*\.|\/\?/,"",$2); n=$2; exit} END {print n}' /dev/null
1

What you look for is in the second field ($2). gsub(/.*\.|\/\?/,"",$2) replaces in $2 all leading characters up to (and including) the period, and the trailing /? by the empty string.

Renaud Pacalet
  • 25,260
  • 3
  • 34
  • 51
  • actually it always return 1 even if the present in another part. For the clarity I've just edited the first post to give an example when it should return 3 for instance.. – James Starlight Apr 01 '22 at 12:42
  • Ah, indeed your first example was a bit ambiguous and I thought you wanted to keep the first number. I changed the regex such that we now keep the second (between the period and `/?`). – Renaud Pacalet Apr 01 '22 at 12:48
  • great! thank you! just one question, is it possible to modify alit bit AWK part to print 1 in the case if no pattern has been found in the log ? For the moment it print nothing for this case (obviously! ) :-) – James Starlight Apr 01 '22 at 13:17
  • Sure, see my update. But please also edit your question to add this specification, else the Q&A will not be consistent anymore and future readers would not understand. – Renaud Pacalet Apr 01 '22 at 13:20
  • thank you very much! well done! sorry for the question: what is the -vn=1 in the updated version ? – James Starlight Apr 01 '22 at 13:28
  • It declares an `awk` variable named `n` and assigns it value `1`. It is (almost) the same as a `BEGIN {n=1}` block. This way if your pattern is not found, the printed value in the `END` block will be `1`, while if it is found it will be whatever number was there. – Renaud Pacalet Apr 01 '22 at 13:30
  • I gotcha, thank you so much! it means that I may change -n to any other number which will be used if the pattern is not find, isn't it ? – James Starlight Apr 01 '22 at 13:33
  • Absolutely. Set it to any value you want as a default, even a text string if you wish. – Renaud Pacalet Apr 01 '22 at 13:33
  • OK, I've just did it! – James Starlight Apr 01 '22 at 13:43
  • I've just created a new topic with furthr development of this story :-) https://stackoverflow.com/questions/71737984/awk-log-processing-based-on-multiple-patterns – James Starlight Apr 04 '22 at 13:21
1

Using GNU awk for the 3rd arg to match():

$ awk 'match($0,/([0-9]+).. GLU 166 N /,a){print a[1]; exit}' file
3

or using any awk:

$ awk 'match($0,/[0-9]+.. GLU 166 N /){sub("/.*",""); print substr($0,RSTART); exit}' file
3

$ awk 'match($0,/[0-9]+.. GLU 166 N /){print substr($0,RSTART,RLENGTH-13); exit}' file
3
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

If GNU awk which supports gensub function is available, would you please try:

awk '/GLU 166 N/ {
    print gensub(/^.*#1\.([0-9]+)\/\? GLU 166 N.*$/, "\\1", 1)
    exit
}'  file

The regex ^.*#1\\.([0-9]+)/\\? GLU 166 N.*$ matches the line with the substring #1.<number>/? "GLU 166 N. The <number> portion, which is enclosed with the parentheses in the regex as ([0-9]+) is captured as group 1, then the entire line is replaced with the group 1, which is specified as the replacement \\1, then it is printed as the result.
Alternatively you can say with GNU sed as:

sed -nE '0,/GLU 166 N/s|^.*#1\.([0-9]+)/\? GLU 166 N.*|\1|p' file

The address 0,/pattern/, where 0 is specific to GNU sed as a starting line, makes the script exit immediately after the 1st pattern match.

tshiono
  • 21,248
  • 2
  • 14
  • 22
  • I've just checked the both methods detect nothing even if the pattern is present in the log file. For the clarity I've just modified log example in the first message to show example when it should return 3 for instance.. – James Starlight Apr 01 '22 at 12:39
  • Thank you for the feedback, but the both scripts output `3` for the modified example. Would you please try with the copy&pasted input file of your post, instead of your original file at hand? – tshiono Apr 01 '22 at 13:20
  • 1
    Perhaps it was due to the awk version (I am using macOSX)... I had to accept another answer that worked in my situation. anyway thank you very much for your version and attention to my question! – James Starlight Apr 01 '22 at 13:37
  • Thank you for the courtesy of your reply. I hope to answer your question well at the next opportunity. BR. – tshiono Apr 01 '22 at 13:56
  • 1
    probably we do have a change in the development of this awk story :-) https://stackoverflow.com/questions/71737984/awk-log-processing-based-on-multiple-patterns – James Starlight Apr 04 '22 at 13:20
0

If awk is not requirement, you can use grep and cut. Simple is good.

λ cat input.txt
fƒdfFinding intramodel H-bonds
Constraints relaxed by 0.5 angstroms and 20 degrees
Models used:
    1.1 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.2 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.3 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.4 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.5 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.6 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.7 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.8 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.9 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.10 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.11 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.12 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.13 SarsCov2_structure49R_nsp5holo_rep1.pdb
    1.14 SarsCov2_structure49R_nsp5holo_rep1.pdb

14 H-bonds
H-bonds (donor, acceptor, hydrogen, D..A dist, D-H..A dist):
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.1/? ASN 142 ND2   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.1/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.1/? ASN 142 1HD2   3.102  2.145
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.3/? GLU 166 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.3/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.3/? GLU 166 H      3.011  2.024
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.4/? GLU 166 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.4/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.4/? GLU 166 H      3.037  2.132
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/? HIS 163 NE2   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/A UNL 888 O   no hydrogen                                                   3.388  N/A
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/? GLU 166 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.5/? GLU 166 H      2.806  1.792
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? THR 26 N      SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? THR 26 H       3.093  2.142
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? GLY 143 N     SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.7/? GLY 143 H      3.030  2.193
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.9/? GLN 189 NE2   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.9/A UNL 888 O   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.9/? GLN 189 2HE2   3.052  2.301
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.10/? GLU 166 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.10/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.10/? GLU 166 H     2.854  1.868
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.12/? GLY 143 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.12/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.12/? GLY 143 H     3.103  2.070
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? GLY 143 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? GLY 143 H     3.161  2.224
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? CYS 145 SG   SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.13/? CYS 145 HG    3.421  2.842
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? ASN 142 ND2  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? ASN 142 2HD2  3.055  2.465
SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? CYS 145 N    SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/A UNL 888 O  SarsCov2_structure49R_nsp5holo_rep1.pdb #1.14/? CYS 145 H     2.924  2.143


grep -om1 '[[:digit:]]*/? GLU 166 N' input.txt | cut -d/ -f1
3

To print 1 when the pattern is not found:

{ grep -om1 '[[:digit:]]*/? GLU 166 N' input.txt || echo 1; } | cut -d/ -f1
Weihang Jian
  • 7,826
  • 4
  • 44
  • 55