3

I'm looking to build a string function to extract the string contents between two markers. It returns an extraction list

def extract(raw_string, start_marker, end_marker):
    ... function ...
    return extraction_list

I know this can be done using regex but is this fast? This will be called billions of times in my process. What is the fastest way to do this?

What happens if the markers are the same and appear and odd number of times?

The function should return multiple strings if the start and end markers appear more than once.

Matt Alcock
  • 12,399
  • 14
  • 45
  • 61

1 Answers1

11

You probably can't go faster than:

def extract(raw_string, start_marker, end_marker):
    start = raw_string.index(start_marker) + len(start_marker)
    end = raw_string.index(end_marker, start)
    return raw_string[start:end]

But if you want to try regex, just try to benchmark it. There's a good timeit module for it.

viraptor
  • 33,322
  • 10
  • 107
  • 191
  • Agreed. If your regex is precompiled it might not be slower than this, but using @viraraptor's solution avoids any regex overhead that might occur. I'm not sure if python's re has that or not, but this is also easier to read and maintain. – andronikus Oct 06 '11 at 18:44
  • Thanks @viraraptor I like this use of index and the fact you've accounted for markers of more than a single chracter. what happens if the start and end markers appear more than once? E.g multiple ' quotes for names. You'd want to return a list of the items in the quotes – Matt Alcock Oct 13 '11 at 12:02