-1

This is a continuation of the question Extract all substrings between two markers. The answers by @Daweo and @Tim Biegeleisen works for small strings.

But for very large strings regular expressions doesn't seem to work. This could be because of a of a limit on string length as seen below:

>>> import re
>>> teststr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
>>> for i in range(0, 23):
...    teststr += teststr # creating a very long string here
... 
>>> len(teststr)
603979776
>>> found = re.findall(r"\&marker1\n(.*?)/\n", newstr)
>>> len(found)
46
>>> found
['The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ']

What could I do to resolve this and find all occurrences between the makers start="&maker1" and end="/\n" ? What is the maximum string length that re can handle?

newkid
  • 1,368
  • 1
  • 11
  • 27

1 Answers1

0

I couldn't get re.findall to work. Now I do use re but to find the location of markers and extract the substrings manually.

locs_start = [match.start() for match in re.finditer("\&marker1", mylongstring)]
locs_end = [match.start() for match in re.finditer("/\n", mylongstring)]

substrings = []
for i in range(0, len(locs_start)):
    substrings.append(mylongstring[locs_start[i]:locs_end[i]+1])
newkid
  • 1,368
  • 1
  • 11
  • 27