0

I'm trying to parse repeating blocks of text that all begin with '----BEGIN---' and end with '---END', using Python. So the text file will look like below. Basically, I want to be able to find each block (words, numbers, and special characters) and parse them for further analysis. The code below is as close as I have gotten, but it returns the entire document, not each block. Any help would be appreciated.

block_search = re.compile('----BEGIN---.*---END',re.DOTALL)
with open(file,'r',encoding='utf-8') as f:
    text = f.read()
    result = re.findall(block_search,text)

----BEGIN--- Words Special Character Numbers words Special character words numbers words words. words numbers words Special character words numbers words words words numbers words words ---END

----BEGIN--- Words words numbers words Special character words numbers words words. words numbers words Special character words numbers words words words numbers words words ... ---END

Clovis
  • 183
  • 1
  • 8

1 Answers1

0

'----BEGIN---.*---END' will match anything from the first occurence of ----BEGIN--- to the last occurence of ---END, that is what .* does. If you want to find the specific block, use .*?, it will stop after the first occurrence of substring after it, or in other words, it will search only until it finds the substring after it.

block_search = re.compile('----BEGIN---.*?---END',re.DOTALL)
ThePyGuy
  • 17,779
  • 5
  • 18
  • 45
  • That got me 90% of the way there. What I don't understand now is that with re.findall() it does not find every instance of the block. – Clovis Jul 22 '21 at 20:42
  • Yeah, you were missing `?` only. For the sample data you have, it is finding both the occurrences. – ThePyGuy Jul 22 '21 at 20:44
  • No. I understood you there. There was a different problem with my code that prevented it from finding the following iterations of the blocks. Thanks for the help! – Clovis Jul 22 '21 at 20:48