0

I'm trying to parse the following with a python regex.

import (
    "github.com/user/qrt"

    "fmt"

    "github.com/user/zyx"
)

import "abcdef"

import "abzdef"

Ideally a single regex would yield:

everything inside the parens as a single group and each item in the single line import statements as a group

Here's what I have for each import statement separately. (see bit to the right of colon..

# import (...)  : r'import\s*(\()(.*?)(\))'
# import ".."   : r'import\s*(\")(.*?)(\")'

I'm thinking I could use something like below to match against the first group to handle deciding if I was parsing a () import or a "" import. (?(id)yes|no) match 'yes' if group 'id' matched, else 'no'

bdbaddog
  • 3,357
  • 2
  • 22
  • 33

1 Answers1

1

Something like this?

Ideone.com

import re

test = """import (
    "github.com/user/qrt"

    "fmt"

    "github.com/user/zyx"
)

import "abcdef"

import "abzdef"
"""

rx = re.compile(r'import\s+([^(]+?$|\([^)]+\))', re.MULTILINE)
rx2 = re.compile(r'".*"', re.MULTILINE)

for m in rx.finditer(test):
    imp = m.group(1)
    if imp[0] == '(':
        for m in rx2.finditer(imp):
            print(m.group(0))
    else:
        print(m.group(1))

Ouput

"github.com/user/qrt"
"fmt"
"github.com/user/zyx"
"abcdef"
"abzdef"

EDIT Just for fun I tried a mockup recursive descent parser. It allows broken syntax; but it's an idea, and is easy to use, just iterate.

http://ideone.com/iS4aww

import re

test = """import (
    "github.com/user/qrt"

    "fmt"

    "github.com/user/zyx"
)

import "abcdef"

import "abzdef"
"""

BEGIN = 1
IMPORT = 2
DESCENT = 3


class Lexer(object):
    def __init__(self, text):
        self._rx = re.compile(r'(import|".*?"|\(|\))')
        self._text = text

    def __iter__(self):
        for m in self._rx.finditer(self._text):
            yield m.group(1)


class RecursiveDescent(object):
    state = BEGIN

    def __init__(self, lexer):
        self._lexer = lexer

    def __iter__(self):
        for token in self._lexer:
            if self.state == BEGIN:
                if token != 'import':
                    # Beginning of the program most likely
                    raise StopIteration
                self.state = IMPORT
            elif self.state == IMPORT:
                if token == '(':
                    self.state = DESCENT
                else:
                    self.state = BEGIN
                    yield token
            elif self.state == DESCENT:
                if token == ')':
                    self.state = BEGIN
                else:
                    yield token

for path in RecursiveDescent(Lexer(test)):
    print(path)

Output

"github.com/user/qrt"
"fmt"
"github.com/user/zyx"
"abcdef"
"abzdef"
totoro
  • 2,469
  • 2
  • 19
  • 23
  • I like this, but was hoping for a single regex, if it's possible. – bdbaddog May 08 '16 at 17:07
  • It's not possible with regex. You will need a http://en.wikipedia.org/wiki/Recursive_descent_parser, which it kind of what this is. Check http://stackoverflow.com/questions/5060659/python-regexes-how-to-access-multiple-matches-of-a-group. – totoro May 08 '16 at 18:01