0

I try to implement a regex to read lines such as :

*     DCH  :   0.80000000                             *
*      PYR  : 100.00000000                            *
*    Bond (  1,   0)  :   0.80000000                  *
*     Angle (  1,   0,   2)  : 100.00000000           *

To that end, I wrote the following regex. It works, but I would like to have some feedback about the way to get the integer numbers in parenthesis. On the lines 3 and 4 above, the part with the integers between parenthesis (a kind of tuple of integers) is optional.

I have to define several groups to be able to define that tuple of integer as optional and to manage the fact that that tuple may contain 2, 3 or 4 integers.

In [64]: coord_patt = re.compile(r"\s+(\w+)\s+(\(((\s*\d+),?){2,4}\))?\s+:\s+(\d+.\d+)")

In [65]: line2 = "*     Angle (  1,   0,   2)  : 100.00000000           *"

In [66]: m = coord_patt.search(line2)

In [67]: m.groups()
Out[67]: ('Angle', '(  1,   0,   2)', '   2', '   2', '100.00000000')

Another example :

In [68]: line = "         *                 Bond (  1,   0)  :   0.80000000           *"

In [69]: m = coord_patt.search(line)
    
In [71]: m.groups()
Out[71]: ('Bond', '(  1,   0)', '   0', '   0', '0.80000000')

As you can see it works, but I do not understand why, in the groups, I got only the last integer and not the each integer separately ? Is there a way to get that integers individually or to avoid to define all that groups and catch only the group 2 which is a string of the tuple which can be easily read otherwise.

Ger
  • 9,076
  • 10
  • 37
  • 48
  • 1
    related? [Parse a tuple from a string](https://stackoverflow.com/q/9763116/10197418) - solutions there however might not fit in your one big regex pattern (not sure if that is a must?) – FObersteiner May 01 '21 at 12:38

3 Answers3

1

As indicated in Capturing repeating subpatterns in Python regex, the re module doesn't support repeated captures, but regex does.

Here are two solutions, one based on regex, the other on re and the safe evaluation of the tuple when one is encountered.

Setup

txt = r"""*     DCH  :   0.80000000                             *
*      PYR  : 100.00000000                            *
*    Bond (  1,   0)  :   0.80000000                  *
*     Angle (  1,   0,   2)  : 100.00000000           *
"""

Using regex

import regex

p = regex.compile(r'\s+(\w+)\s+(?:\((?:\s*(\d+),?){2,4}\))?\s+:\s+(\d+.\d+)')

for s in txt.splitlines():
    if m := p.search(s):
        name = m.group(1)
        tup = tuple(int(k) for k in m.captures(2) if k.isnumeric())
        val = float(m.group(3))
        print(f'{name!r}\t{tup!r}\t{val!r}')

Prints:

'DCH'   ()  0.8
'PYR'   ()  100.0
'Bond'  (1, 0)  0.8
'Angle' (1, 0, 2)   100.0

Using re

import re
import ast

p = re.compile(r'\s+(\w+)\s+(\((?:\s*\d+,?){2,4}\))?\s+:\s+(\d+.\d+)')

for s in txt.splitlines():
    if m := p.search(s):
        name, tup, val = m.groups()
        tup = ast.literal_eval(tup) if tup is not None else ()
        val = float(val)
        print(f'{name!r}\t{tup!r}\t{val!r}')

Prints:

'DCH'   ()  0.8
'PYR'   ()  100.0
'Bond'  (1, 0)  0.8
'Angle' (1, 0, 2)   100.0
Pierre D
  • 24,012
  • 7
  • 60
  • 96
  • 1
    Thank you also for the `:=` operator. I had missed that new feature of 3.8 ! – Ger May 03 '21 at 11:15
  • `regex` seems to be 2 times faster than `re`. Is it general, is that `regex` should be preferred to `re` ? – Ger May 03 '21 at 21:41
1

This is probably far more elaborate than what you are looking for, but I thought I would present it anyway as something to be added to your "toolbox" for it will handle even more complicated situations as it is actually a top-down parser and thus able to handle languages that cannot be defined by regular expressions.

from typing import NamedTuple
import re

lines = """
DCH  :   0.80000000
PYR  : 100.00000000
Bond (  1,   0)  :   0.80000000
Angle (  1,   0,   2)  : 100.00000000
"""

class Token(NamedTuple):
    type: str
    value: str

# The order of these matters because they are tried in turn:
token_specification = [
    ('WORD', r'[a-zA-Z]\w*'), # cannot use `\w+` since that would also match numbers
    ('COLON', r':'),
    ('LPAREN', r'\('),
    ('RPAREN', r'\)'),
    ('FLOAT', r'\d+\.\d+'),
    ('INT', r'\d+'),
    ('COMMA', r','),
    ('SKIP', r'\s+'),
    ('ERROR', r'.') # anything else
]
tok_regex = re.compile('|'.join('(?P<%s>%s)' % pair for pair in token_specification))

def generate_tokens(code):
    scanner = tok_regex.scanner(code)
    for m in iter(scanner.match, None):
        type = m.lastgroup
        if type == 'SKIP':
            continue
        if type == 'FLOAT':
            value = float(m.group()) # or just m.group()
        elif type == 'INT':
            value = int(m.group()) # or just m.group()
        else:
            value = m.group()
        yield Token(type, value)
    yield Token('EOF', 'EOF') # end of string


class Evaluator():

    def parse(self, s):
        self.token_iterator = generate_tokens(s)
        self.next_token()
        try:
            while self.token.type != 'EOF': # not end of string
                yield self.evaluate()
        except Exception:
            pass

    def evaluate(self):
        # current token should be WORD
        word_value = self.token.value
        self.accept('WORD') # throw exception if not 'WORD'
        i_list = self.optional_int_list()
        # current token should be a colon
        self.accept('COLON')
        # current token should be a float
        float_value = self.token.value
        self.accept('FLOAT')
        return word_value, i_list, float_value

    def optional_int_list(self):
        i_list = []
        if self.token.type == 'LPAREN':
            self.next_token()
            # current token should be an integer
            i_list.append(self.token.value)
            self.accept('INT')
            while self.token.type == 'COMMA':
                self.next_token()
                # next token should be an integer
                i_list.append(self.token.value)
                self.accept('INT')
            # next token should be a right parentheses:
            self.accept('RPAREN')
        return i_list

    def next_token(self):
        self.token = next(self.token_iterator, None)

    def accept(self, type):
        if self.token.type != type:
            raise Exception(f'Error: was expecting a {type}, got {self.token.type}')
        self.next_token()

evaluator = Evaluator()
for word_value, integer_values, float_value in evaluator.parse(lines):
     print(word_value, integer_values, float_value)

Prints:

DCH [] 0.8
PYR [] 100.0
Bond [1, 0] 0.8
Angle [1, 0, 2] 100.0
Booboo
  • 38,656
  • 3
  • 37
  • 60
0

To convert integers inside parenthesis which are currently string to int you have to convert it to tuple. It will store all the tuples in a list which you can retrieve later to do operations on it.

import re
from ast import literal_eval as make_tuple

lines = [
    "*    DCH  :   0.80000000                       *",
    "*    PYR  : 100.00000000                       *",
    "    Bond (  1,   0)  :   0.80000000                  *",
    "*     Angle (  1,   0,   2)  : 100.00000000           *",
]

coord_patt = re.compile(r"\s+(\w+)\s+(\(((\s*\d+),?){2,4}\))?\s+:\s+(\d+.\d+)")
tuples = list()
for line in lines:
    temp = coord_patt.search(line)

    if temp.groups()[1] is not None:
        tuples.append(make_tuple(temp.groups()[1]))

print(tuples)

for tup in tuples:
    for element in tup:
        print(element, end=' ')

    print()

Here's the output

Output:
[(1, 0), (1, 0, 2)]
1 0
1 0 2
Linux Geek
  • 957
  • 1
  • 11
  • 19
  • seems a repeat of the second part of my answer, no? – Pierre D May 01 '21 at 13:23
  • @PierreD I am not sure why you felt that way, but I was already writing my answer when you posted the answer. – Linux Geek May 01 '21 at 13:27
  • 1
    no worries, it looked quite similar. But then again, the link given by @MrFuppes comment (which I hadn't seen until now) is also pointing to the same approach... LOL, I guess there are only so many ways to skin a cat... – Pierre D May 01 '21 at 13:34