Extract compound data from SDF file using IDNUMBER and write to a new file

Question

I'm still quite new to awk and have been trying to use a bash script and awk to filter a file according to a list of codes in a separate text file. While there are a few similar questions around, I have been unable to adapt their implementations.

My first file idnumber.txtlooks like this:

4323-7584
K8933-4943
L2837-0493

The file I am attempting to filter the molecule blocks from has entries as follows:

  -ISIS-  -- StrEd -- 

 28 29  0  0  0  0  0  0  0  0999 V2000
   -1.7382    0.7650    0.0000 C   0  0  0  
18 27  1  0  0  0  0
M  END
>  <IDNUMBER> (K784-9550)
K784-9550

$$$$
  -ISIS-  -- StrEd -- 

 28 29  0  0  0  0  0  0  0  0999 V2000
   -1.7382    0.7650    0.0000 C   0  0  0  
18 27  1  0  0  0  0
M  END
>  <IDNUMBER> (4323-7584)
4323-7584

$$$$
  -ISIS-  -- StrEd -- 

 28 29  0  0  0  0  0  0  0  0999 V2000
   -1.7382    0.7650    0.0000 C   0  0  0  
18 27  1  0  0  0  0
M  END
>  <IDNUMBER> (4323-7584)
L2789-0943

$$$$
  -ISIS-  -- StrEd -- 

 28 29  0  0  0  0  0  0  0  0999 V2000
   -1.7382    0.7650    0.0000 C   0  0  0  
18 27  1  0  0  0  0
M  END
>  <IDNUMBER> (4323-2738)
4323-2738

> <SALT>
NaCl

$$$$

The file repeats in this fashion, starting with the -ISIS- -- StrEd -- and ending with the $$$$. I need to extract this entire block for each string in IDNUMBER. So the expected output would be the block from above from -ISIS- to the $$$$ that has a matching ID in the IDNUMBER.txt. Each entry is a different length, and I am trying to extract the entire block from the -ISIS- -- StrEd --

I have tried a few options of sed trying to recognise the first line to the IDNUMBER and extracting around it but that didn't work. My current iteration of the code is as follows:

#!/bin/bash
cat idnumbers.txt | while read line
do
  sed -n '/^-ISIS-$/,/^$line$/p' compound_library.sdf > filtered.sdf
done

The logic behind what I was attempting was to find the block that would match the start as the ISIS phrase and end with the relevant ID number, copying that to a file. I realise now that what my logic was doing would skip the $$$$ that terminates each block. But I have a feeling I am missing something as it is not actually writing anything to filtered.sdf.

Expected output:

  -ISIS-  -- StrEd -- 

 28 29  0  0  0  0  0  0  0  0999 V2000
   -1.7382    0.7650    0.0000 C   0  0  0  
18 27  1  0  0  0  0
M  END
>  <IDNUMBER> (4323-7584)
4323-7584

$$$$
  -ISIS-  -- StrEd -- 

 28 29  0  0  0  0  0  0  0  0999 V2000
   -1.7382    0.7650    0.0000 C   0  0  0  
18 27  1  0  0  0  0
M  END
>  <IDNUMBER> (4323-7584)
L2789-0943

$$$$

Edit: So I have tried a different approach based on another question but have not been able to figure out how to alter the key assigned to a record in awk based on recognizing the characters at the line containing the IDNUMBER because it is a different field for each record.

awk 'BEGIN{RS="\\$\\$\\$\\$"; ORS="$$$$"}
     (NR==FNR){a[$1]=$0; next}
     ($1 in a) { print a[$1] }' file1.sdf RS="\n" file2.txt

I assume it would be a matter of changing the field reference in the array $1 to an expression that recognizes the line after > <IDNUMBER>(xyz), but I am unsure how to go about achieving that.

Thanks for sharing your efforts, could you please do mention sample output in your question to make it more clear, thank you. — RavinderSingh13, Sep 16 '22 at 07:28
Also if you could explain logic of getting expected output in your question that will make your question more clear, cheers. — RavinderSingh13, Sep 16 '22 at 07:31
If you want help extracting individual blocks from a file containing multiple blocks, then show a file containing multiple blocks as your sample input (as what separates the blocks is just as important as what they contain) and add the expected output. So, [edit] your question to show 4 or 5 blocks, some of which you want printed and some you don't, but obviously don't show 50 lines or whatever that is per block, reduce it to, say, 5. — Ed Morton, Sep 16 '22 at 10:29
Thanks for the input, I've tried to clarify as per the questions you've asked. — protein_fashion, Sep 16 '22 at 12:51

gniemetz · Answer 1 · 2022-09-19T08:12:07.580

0

Maybe this is what you are looking for, some explanation:

[[:blank:]] -> Space or tab only, not newline characters

First regex is looking for the start pattern -ISIS- -- StrEd -- you mentioned (with a variable length of spaces/tabs between), and if it's a match, the variable found is set to 1

Second regex is looking for the end pattern > <IDNUMBER> (xxxx-xxxx) (also with a variable length of spaces/tabs), where xxxx-xxxx is coming from the file idnumber.txt, and if it's a match set found to 2. So now we know we are between the desired start and end of "idnumber"-text we want to print

Third regex is looking for $$$$ and set found to 3 if matching. This is the "real" endpoint, so jump with exit to the END section

So if the value of found is less or equal 2 the input line of compound_library.sdf is saved to variable text

At the END block of the awk the value of found is checked for the value 3 so the whole variable text is printed

while IFS= read IdNumber; do
  awk '
    BEGIN {
      found=0
      }
    /^[[:blank:]]*-ISIS-[[:blank:]]*--[[:blank:]]*StrEd[[:blank:]]*--/ {
      found=1
      }
    /^>[[:blank:]]*<IDNUMBER>[[:blank:]]*\('"${IdNumber}"'\)/ {
      found++
      #print "IdNumber='"${IdNumber}"', found=" found >>"/dev/stderr"
      }
    found <= 2 {
      text=sprintf("%s%s\n", text, $0)
      }
    /^\$\$\$\$$/ {
      found++
      exit
      }
    END {
      if (found == 3) {
        printf text
        }
      }' \
  compound_library.sdf
  #compound_library.sdf > ${IdNumber}.sdf
done < idnumber.txt

edited Sep 19 '22 at 08:12

answered Sep 16 '22 at 07:59

gniemetz

16
4

1

Calling awk in a shell loop is very rarely the right approach, it's not necessary for a problem like this where a single awk script could do the job orders of magnitude faster. See [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/q/169716/133219) and [how-do-i-use-shell-variables-in-an-awk-script](https://stackoverflow.com/q/19075671/1745001) and [correct-bash-and-shell-script-variable-capitalization](https://stackoverflow.com/q/673055/1745001) for more on ways to improve your script. – Ed Morton Sep 16 '22 at 10:32
Is it unwise to use awk in a shell loop due to efficiency? – protein_fashion Sep 16 '22 at 15:01
1

Thanks gniemetz for the attempt, just have a few questions. So with the ```[[:sapce:]]``` does it go for only a single space or does it register for all whitespace between them? I've also tried having a look for the ```found ==``` that you called several times, what does that actually call/do? The end of the block is ```$$$$```, how would that be implemented in the awk command? – protein_fashion Sep 16 '22 at 15:04
1

@protein_fashion efficiency, robustness, clarity, portability, etc. as stated in that first link I referenced. – Ed Morton Sep 17 '22 at 22:22
btw to notify someone you left a comment for them add their name preceded by @ otherwise they won't necessarily know. – Ed Morton Sep 17 '22 at 22:33

Daweo · Answer 2 · 2022-09-16T08:44:01.077

I am missing something

In this command

sed -n '/^-ISIS-$/,/^$line$/p' compound_library.sdf > filtered.sdf

you are using following regular expressions

^-ISIS-$
^$line$

^ denotes start of line, $ denotes end of line

1st is looking for -ISIS- spanning whole line, whilst your file has

  -ISIS-  -- StrEd --

that is -ISIS- as part of line, therefore you should use regular expression without anchors that is -ISIS-

2nd does include $ and then some other characters (line) implying some character being after end, which is impossible, so your code will keeping printing until all file is made, I have not idea if this is desired behavior, but be warned that more common way to do so in GNU sed is using $ as address (meaning last line) for example if you want to print first line holding digit and all following you could do

sed -n '/[0-9]/,$p' file.txt

Ah ok thanks for clarifying that for me. I had been using $line to call from the while loop of line. The idea was that I would loop through the file.txt of all the ID numbers and at each loop would run the sed command but I am thinking now that is not the correct logic. — protein_fashion, Sep 16 '22 at 14:47

Extract compound data from SDF file using IDNUMBER and write to a new file

2 Answers2