I'm still quite new to awk and have been trying to use a bash script and awk to filter a file according to a list of codes in a separate text file. While there are a few similar questions around, I have been unable to adapt their implementations.
My first file idnumber.txt
looks like this:
4323-7584
K8933-4943
L2837-0493
The file I am attempting to filter the molecule blocks from has entries as follows:
-ISIS- -- StrEd --
28 29 0 0 0 0 0 0 0 0999 V2000
-1.7382 0.7650 0.0000 C 0 0 0
18 27 1 0 0 0 0
M END
> <IDNUMBER> (K784-9550)
K784-9550
$$$$
-ISIS- -- StrEd --
28 29 0 0 0 0 0 0 0 0999 V2000
-1.7382 0.7650 0.0000 C 0 0 0
18 27 1 0 0 0 0
M END
> <IDNUMBER> (4323-7584)
4323-7584
$$$$
-ISIS- -- StrEd --
28 29 0 0 0 0 0 0 0 0999 V2000
-1.7382 0.7650 0.0000 C 0 0 0
18 27 1 0 0 0 0
M END
> <IDNUMBER> (4323-7584)
L2789-0943
$$$$
-ISIS- -- StrEd --
28 29 0 0 0 0 0 0 0 0999 V2000
-1.7382 0.7650 0.0000 C 0 0 0
18 27 1 0 0 0 0
M END
> <IDNUMBER> (4323-2738)
4323-2738
> <SALT>
NaCl
$$$$
The file repeats in this fashion, starting with the -ISIS- -- StrEd --
and ending with the $$$$
. I need to extract this entire block for each string in IDNUMBER. So the expected output would be the block from above from -ISIS- to the $$$$ that has a matching ID in the IDNUMBER.txt.
Each entry is a different length, and I am trying to extract the entire block from the -ISIS- -- StrEd --
I have tried a few options of sed
trying to recognise the first line to the IDNUMBER and extracting around it but that didn't work. My current iteration of the code is as follows:
#!/bin/bash
cat idnumbers.txt | while read line
do
sed -n '/^-ISIS-$/,/^$line$/p' compound_library.sdf > filtered.sdf
done
The logic behind what I was attempting was to find the block that would match the start as the ISIS phrase and end with the relevant ID number, copying that to a file. I realise now that what my logic was doing would skip the $$$$ that terminates each block.
But I have a feeling I am missing something as it is not actually writing anything to filtered.sdf
.
Expected output:
-ISIS- -- StrEd --
28 29 0 0 0 0 0 0 0 0999 V2000
-1.7382 0.7650 0.0000 C 0 0 0
18 27 1 0 0 0 0
M END
> <IDNUMBER> (4323-7584)
4323-7584
$$$$
-ISIS- -- StrEd --
28 29 0 0 0 0 0 0 0 0999 V2000
-1.7382 0.7650 0.0000 C 0 0 0
18 27 1 0 0 0 0
M END
> <IDNUMBER> (4323-7584)
L2789-0943
$$$$
Edit: So I have tried a different approach based on another question but have not been able to figure out how to alter the key assigned to a record in awk based on recognizing the characters at the line containing the IDNUMBER because it is a different field for each record.
awk 'BEGIN{RS="\\$\\$\\$\\$"; ORS="$$$$"}
(NR==FNR){a[$1]=$0; next}
($1 in a) { print a[$1] }' file1.sdf RS="\n" file2.txt
I assume it would be a matter of changing the field reference in the array $1
to an expression that recognizes the line after > <IDNUMBER>(xyz)
, but I am unsure how to go about achieving that.