Using Python 2.6.6.
I was hoping that the re
module provided some method of searching that mimicked the way str.find()
works, allowing you to specify a start index, but apparently not...
search()
lets me find the first match...findall()
will return all (non-overlapping!) matches of a single patternfinditer()
is likefindall()
, but via an iterator (more efficient)
Here is the situation... I'm data mining in huge blocks of data. For parts of the parsing, regex works great. But once I find certain matches, I need to switch to a different pattern, or even use more specialized parsing to find where to start searching next. If re.search
allowed me to specify a starting index, it would be perfect. But in absence of that, I'm looking at:
- Using
finditer()
, but skipping forward until I reach an index that is past where I want to resume usingre
. Potential problems:- If the embedded binary data happens to contain a match that overlaps a legitimate match just after the binary chunk...
- Since I'm not searching for a single pattern, I'd have to juggle multiple iterators, which also has the possibility of a false match hiding the real one.
- Slicing, i.e., creating a copy of the remainder of the data each time I want to search again.
- This would be robust, but would force a lot of "needless" copying on data that could be many megabytes.
- I'd prefer to keep it so that all match locations were indexes into the single original string object, since I may hang onto them for a while and want to compare them. Finding subsequent matches within separate sliced-off copies is a bookkeeping hassle.
- Just occurred to me that I may be able to use a "rotating buffer" sort of approach, but haven't thought it through completely. That might introduce a lot of complexity to the code.
Am I missing any obvious alternatives? Not sure if there would be a way to wrap a huge string with a class that would serve slices... Or a slicing sort of iterator or "string cursor" idiom?