1

I am trying to match any text or any thing between two specific word START and END.

    START
    aba
    asds
    asdsa 
    END

    NOTREQUIRED

    START
    fdfdfsds
    ssdsds
    sdsds
    END

    START
    aba
    asds
    asdsa 
    END

    NOTREQUIRED

    START
    fdfdfsds
    ssdsds
    sdsds
    END

I have written a reg rule like this

    START[\s\S]END 

Problem is it is matching from first word of START to last occurrence of word END in the document.

and then I modified to rule

    START(.*?)END

Now it only match the first set.

I want to match first occurrence of START with first occurrence of word END and Second occurrence of word START with second occurrence of word END and so on. How do I write my reg rule. I tried several rules as mention in this stack over flow thread but could not satisfy my need.

Please advice.

Community
  • 1
  • 1
shakthydoss
  • 2,551
  • 6
  • 27
  • 36

2 Answers2

0

Your regex works perfectly fine, you just have to apply it many times. This can be done using re.finditer():

preg = re.compile(r'START(.*?)END', re.DOTALL)

for match in preg.finditer(text):
    print(match.group(1).strip() + '\n')
Finwood
  • 3,829
  • 1
  • 19
  • 36
0

Simply use re.findall with re.S flag. re.S makes the . character match every character including newlines.

Demo:

>>> text = """START
...     aba
...     asds
...     asdsa 
...     END
... 
...     NOTREQUIRED
... 
...     START
...     fdfdfsds
...     ssdsds
...     sdsds
...     END
... 
...     START
...     aba
...     asds
...     asdsa 
...     END
... 
...     NOTREQUIRED
... 
...     START
...     fdfdfsds
...     ssdsds
...     sdsds
...     END"""
>>> re.findall('START(.*?)END', text, re.S)
['\n    aba\n    asds\n    asdsa \n    ', '\n    fdfdfsds\n    ssdsds\n    sdsds\n    ', '\n    aba\n    asds\n    asdsa \n    ', '\n    fdfdfsds\n    ssdsds\n    sdsds\n    ']
>>> for i in re.findall('START(.*?)END', text, re.S): print i
... 

    aba
    asds
    asdsa 


    fdfdfsds
    ssdsds
    sdsds


    aba
    asds
    asdsa 


    fdfdfsds
    ssdsds
    sdsds
Irshad Bhat
  • 8,479
  • 1
  • 26
  • 36