Performing incremental regex searches in huge strings (Python)

Question

Using Python 2.6.6.

I was hoping that the re module provided some method of searching that mimicked the way str.find() works, allowing you to specify a start index, but apparently not...

search() lets me find the first match...
findall() will return all (non-overlapping!) matches of a single pattern
finditer() is like findall(), but via an iterator (more efficient)

Here is the situation... I'm data mining in huge blocks of data. For parts of the parsing, regex works great. But once I find certain matches, I need to switch to a different pattern, or even use more specialized parsing to find where to start searching next. If re.search allowed me to specify a starting index, it would be perfect. But in absence of that, I'm looking at:

Using finditer(), but skipping forward until I reach an index that is past where I want to resume using re. Potential problems:
- If the embedded binary data happens to contain a match that overlaps a legitimate match just after the binary chunk...
- Since I'm not searching for a single pattern, I'd have to juggle multiple iterators, which also has the possibility of a false match hiding the real one.
Slicing, i.e., creating a copy of the remainder of the data each time I want to search again.
- This would be robust, but would force a lot of "needless" copying on data that could be many megabytes.
- I'd prefer to keep it so that all match locations were indexes into the single original string object, since I may hang onto them for a while and want to compare them. Finding subsequent matches within separate sliced-off copies is a bookkeeping hassle.
Just occurred to me that I may be able to use a "rotating buffer" sort of approach, but haven't thought it through completely. That might introduce a lot of complexity to the code.

Am I missing any obvious alternatives? Not sure if there would be a way to wrap a huge string with a class that would serve slices... Or a slicing sort of iterator or "string cursor" idiom?

possible duplicate of [Python: find regexp in a file](http://stackoverflow.com/questions/4989198/python-find-regexp-in-a-file) — Sean Vieira, Nov 28 '13 at 04:07
@SeanVieira: Yes, similarities but far from duplicate. mmap doesn't address the issue of incremental search, and the line-oriented search suggestion is even worse than slicing... — Agent Friday, Nov 28 '13 at 04:15
re.seaarch() has a start argument, pos - http://docs.python.org/2.6/library/re.html#re.RegexObject.search — wwii, Nov 28 '13 at 05:38
@AgentFriday - a memory mapped file can be sliced in multiple ways and all the slices are just references to the same underlying data (rather than duplicating strings in memory), thus removing one of the two objections you had to Option #2. :-) — Sean Vieira, Nov 28 '13 at 05:51
@wwii: correction: class re.RegexObject has a search( string, pos, endpos), not re.seaarch() [sic] -- This is awesome, I failed to dig deeper than the "convenience functions" in the re module. I think creating RegexObject instances is exactly what I needed. Care to write up as an answer? — Agent Friday, Nov 28 '13 at 22:28

score 4 · Answer 1 · answered Nov 28 '13 at 04:02

4

Use a two-pass approach. The first pass uses the first regex to find the "interesting bits" and outputs those offsets into a separate file. You didn't say if you can tell where the "end" of each interesting segment is, but you'd include that too if available. The second pass uses the offsets to load sections of the file as independent strings and then applies whatever secondary regex you like on each smaller string.

answered Nov 28 '13 at 04:02

David Pope

6,457
2
35
45

How would you return the offset of the "interesting bit found"? – Floris Nov 28 '13 at 04:05
@Floris - Wouldn't you use the start or span attributes of the match objects? – wwii Nov 28 '13 at 05:40
@wwii - you may be right; I thought the answer deserved to be expanded with that information as I don't know what the answer is but it matters... – Floris Nov 28 '13 at 05:45
Similarly, you could write a generator to *feed* the interesting pieces to a processor. – wwii Nov 28 '13 at 05:54

Performing incremental regex searches in huge strings (Python)

1 Answers1