5

I want to search for a regex match in a larger string from a certain position onwards, and without using string slices.

My background is that I want to search through a string iteratively for matches of various regex's. A natural solution in Python would be keeping track of the current position within the string and using e.g.

re.match(regex, largeString[pos:])

in a loop. But for really large strings (~ 1MB) string slicing as in largeString[pos:] becomes expensive. I'm looking for a way to get around that.

Side note: Funnily, in a niche of the Python documentation, it talks about an optional pos parameter to the match function (which would be exactly what I want), which is not to be found with the functions themselves :-).

ThomasH
  • 22,276
  • 13
  • 61
  • 62

4 Answers4

6

The variants with pos and endpos parameters only exist as members of regular expression objects. Try this:

import re
pattern = re.compile("match here")
input = "don't match here, but do match here"
start = input.find(",")
print pattern.search(input, start).span()

... outputs (25, 35)

Martin Stone
  • 12,682
  • 2
  • 39
  • 53
  • This is crazy! The ``pos`` param is actually there, but only with the object methods! I must have been blind ... Thanks a lot, also to the other guys. – ThomasH Jun 09 '11 at 10:17
4

The pos keyword is only available in the method versions. For example,

re.match("e+", "eee3", pos=1)

is invalid, but

pattern = re.compile("e+")
pattern.match("eee3", pos=1)

works.

Jeremy
  • 1
  • 85
  • 340
  • 366
  • ... and I was so sure that the only difference between the module function and the object method was the regex parameter (and maybe the flags) :-/ . Blame me. – ThomasH Jun 09 '11 at 10:28
2
>>> import re
>>> m=re.compile ("(o+)")
>>> m.match("oooo").span()
(0, 4)
>>> m.match("oooo",2).span()
(2, 4)
Timofey Stolbov
  • 4,501
  • 3
  • 40
  • 45
1

You could also use positive lookbehinds, like so:

import re

test_string = "abcabdabe"

position=3
a = re.search("(?<=.{" + str(position) + "})ab[a-z]",test_string)

print a.group(0)

yields:

abd
MGwynne
  • 3,512
  • 1
  • 23
  • 35
  • Thanks for the idea, but for long input strings, if I'm searching towards the end of that string, this would make a verrryyyy long look-behind :). But I'll keep it for later. – ThomasH Jun 09 '11 at 10:22