1

I'm struggling with understanding the syntax of repeated regex patterns in python. This is my code:

import re

string='''
-GOLD- 10181914 93D 1 1.00000 0.00000

58 61 0 0 0 0 0 0 0 0 1 V2000
3.4354 -3.4974 -16.5634 N 0 0 0 0 0 0 0 0 0 0 0 0
4.5427 -4.0070 -16.0569 C 0 0 0 0 0 0 0 0 0 0 0 0
5.5389 -3.2151 -15.7189 N 0 0 0 0 0 0 0 0 0 0 0 0
6.3839 -3.5953 -15.3094 H 0 0 0 0 0 0 0 0 0 0 0 0
'''

line_pat = '([+-]*\d+.\d+\s+){3}\w+'

print(re.findall(line_pat,string))

What I'm trying to capture are all the lines, which contain the three floats and the capital letter as: "3.4354 -3.4974 -16.5634 N". Why no gusto?

J.Doe
  • 224
  • 1
  • 4
  • 19
  • PyPi regex library can handle the pattern and keeps a capture stack for each group. – Wiktor Stribiżew Nov 20 '19 at 15:03
  • Even though the question is closed, your regex is off by a little bit. Try: line_pat = '([+-]\d+.\d+\s+)\w+' Let re.findall handle multiple occurrences for you and make sure to escape ".", you also didn't need the * after the first block – Ulises André Fierro Nov 20 '19 at 15:04
  • Ok, should've looked harder through the suggested questions, I guess. Anyways, apparently a good way to capture lines of 3D cartesian coordinates from file types such as .xyz .mol2 .sdf is to use the following regex: ```((?:[+-]*\d+.\d+\s+){3}[a-zA-Z]+)``` – J.Doe Nov 20 '19 at 15:18
  • You do not need a regex in the first place. Split the lines, split each one with space, if number of items is 4 or more, check if Field 4 is an uppercase letter, if yes, grab the first 3 into a tuple/list and add to the resulting list. – Wiktor Stribiżew Nov 20 '19 at 15:37
  • This is just a small extract from an .sdf file. There's a lot of other information in there. It's actually easier and safer to just load the file as a string and search for the pattern. I'm writing a basic sdf to xyz file converter. In that file type, your method would work no problem. – J.Doe Nov 20 '19 at 15:43

0 Answers0