1

I have a text file. I need to find a part of the file that starts with some arbitrary pattern, and then capture everything between the pattern and it's closing paren. This pattern may appear multiple times in the file. "Start (" will always appear right before the pattern. Example:

start
(
    pattern
    (
        stuff,
        stuff,
            randomThing
            (
                random stuff
            )
    )
)
start
(
    notThePattern
    (
        otherStuff,
        otherStuff
    )
)
start
(
    pattern
    (
        moreStuff,
        moreStuff
    )
)

I would want to get [Start(Pattern(stuff,stuff,randomThing(random stuff))), Start(Pattern(moreStuff,moreStuff)) ].

The way i've done it is with this code:

def myFunct(pattern, input):
    allElements = []
    match = re.search("start\s*?\(\s*?" + pattern, input)
    while (match != None):
        index = match.start()
        element = getElementEndIndex(line[index:])
        allElements.append(element)
        input = input[index+len(element):]
        match = re.search("start\s*?\(\s*?" + pattern, input)

getElementEndIndex just uses a stack to find the closing paren and it's index.

Is this the only way to do this? Can it be solved with just a regex? If not, is there a better way of running the regex that I do have?

Pattern can appear multiple times within a "start" section. Start cannot be within another start section though.

start
(
    pattern
    ()
    blah
    ()
    pattern
    ()
)

is possible, but

start
(
    pattern
    ()
    start
    ()
)

is NOT

user1652427
  • 697
  • 2
  • 8
  • 21
  • 1
    Can these patterns be nested? (e.g. `start(start(pattern(stuff))pattern(morestuff))`) ? If so, what would you expect the behavior to be? If they can't be nested, I would be tempted to do this with `split` and `join`. – Henry Keiter Oct 16 '13 at 15:23
  • Dont know python, but is line in `line[index:])` an array of every character position in the source text? Also, if python does regex recursion, I could show you how to do this. If the nesting is fixed to 2 levels max, regex recursion may not be necessary. –  Oct 16 '13 at 17:11
  • AFAIK you *cannot* do this with a single regex. Regexes cannot count, and determine nested-parentheses patterns requires counting. You need a context-free grammar for that. – Bakuriu Oct 16 '13 at 19:56

5 Answers5

1

So if "stuff" contains parens you can't match that with a regular language as you'd need to start counting left and right parens.

However, if stuff does not. you could do something like pattern newline ( ("not )" or newline) once or more )

so that's going to look something like

/pattern\n(\([^)]\+\|\n\)\+)/

This is similar if not a duplicate of this. So maybe the best solution is to use pyparsing to form a pda of sorts.

If you have control of whatever is writing these things, you might opt for a well known format like json in the future so that you can make use of tools that are already setup to solve this sort of problem. It's more expressive and more portable.

Community
  • 1
  • 1
cs_alumnus
  • 1,579
  • 16
  • 24
  • Right, regular languages are not capable of handling arbitrary depth pattern matching. You'd need to do something like create a stack or count left and right parens. – jeremy Oct 16 '13 at 18:46
0

You state:

I would want to get [Start(Pattern(stuff,stuff)), Start(Pattern(moreStuff,moreStuff)) ].

If so, that is pretty easy:

import re

txt='''\
start
(
    pattern
    (
        stuff,
        stuff
    )
)
start
(
    notThePattern
    (
        otherStuff,
        otherStuff
    )
)
start
(
    pattern
    (
        moreStuff,
        moreStuff
    )
)
'''
rst=[]
for m in re.finditer(r'^(start.*?)(?=start|\Z)', txt, re.S | re.M):
    rst.append(m.group(1).replace('\n','').replace(' ',''))

print rst

prints:

['start(pattern(stuff,stuff))', 
 'start(notThePattern(otherStuff,otherStuff))', 
 'start(pattern(moreStuff,moreStuff))']

Is that what you want? That does not validate that the number of parens is correct.

dawg
  • 98,345
  • 23
  • 131
  • 206
0

Here is a start. You will need massage the list items and prepend the word start to get the exact format you need.

import re
s = """start
(
    pattern
    (
        stuff,
        stuff
    )
    blah
    (
        baz,
        baz
     )
    pattern
    (
        xtrastuff,
        xtrastuff
    )
)
start
(
    notThePattern
    (
        otherStuff,
        otherStuff
    )
)
start
(
    pattern
    (
        moreStuff,
        moreStuff
    )
)"""
# remove all whitespace
s1 = re.sub('\s','',s)
## 'start(pattern(stuff,stuff)blah(bazbaz)pattern(xtrastuff,xtrastuff))start(notThePattern(otherStuff,otherStuff))start(pattern(moreStuff,moreStuff))'

# stuff you are looking for
pattern = 'pattern.*?\)'

# find all of the start 'items' with pattern in them
start_pattern = '(start\(' + pattern + '\))'
starts = re.findall(start_pattern, s1)
## ['start(pattern(stuff,stuff)blah(baz,baz)pattern(xtrastuff,xtrastuff))', 'start(pattern(moreStuff,moreStuff))']

# extract stuff you are looking for from all the 'start' items
for start in starts:
    stuff =  re.findall(pattern, start)
    print stuff
    print '*'*8

>>> 
['pattern(stuff,stuff)', 'pattern(xtrastuff,xtrastuff)']
********
['pattern(moreStuff,moreStuff)']
********
wwii
  • 23,232
  • 7
  • 37
  • 77
0

Is there a better way to find the closing paren of an expression found with a regex?

Yes. Don't use regexes.

Specifically, you want to use a parser of some sort; working with an actual data structure is much easier than ad-hoc regex-matched pieces of text.

Writing a parser is a subject that requires vastly more information than will fit in an SO answer (and I have to admit it's a weak point in my knowledge). The much easier solution is to adapt your data to an already-defined format, and use that format's parser. Common options include JSON, INI, and Unix shell.

If you must write your own parser, you may want to look into something like pyPEG or parsimonious, or any of the other tools on this extensive list.

Xiong Chiamiov
  • 13,076
  • 9
  • 63
  • 101
  • I would say your right if what you need it for is significantly complex, and speed is a factor. Nothing wrong with using regex if the language supports recursion. Dot-Net may have the best solution in its balanced groups, and the ability to capture-as-you-go deeper into nesting. For others like PCRE and perl that can't do that, each inner 'core' has to be re-parsed for dangling data of interrest. This could slow things down, but is simple for prototyping. Simple example `start(?\((?(?:(?>[^()]+)|(?&paren))*)\))` –  Oct 16 '13 at 20:14
0

From the LilyPond convert-ly utility (and written/copyrighted by myself, so I can show it off here):

def paren_matcher (n):
    # poor man's matched paren scanning, gives up
    # after n+1 levels.  Matches any string with balanced
    # parens inside; add the outer parens yourself if needed.
    # Nongreedy.
    return r"[^()]*?(?:\("*n+r"[^()]*?"+r"\)[^()]*?)*?"*n

convert-ly tends to use this as paren_matcher (25) in its regular expressions which is likely overkill for most applications. But then it uses it for matching Scheme expressions.