How do I build a tokenizing regex based iterator in python

Question

I'm basing this question on an answer I gave to this other SO question, which was my specific attempt at a tokenizing regex based iterator using more_itertools's pairwise iterator recipe.

Following is my code taken from that answer:

from more_itertools import pairwise
import re

string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer(r"^|[ ]+|$", string)):
    print(string[prev.end(): curr.start()])  # originally I yield here

I then noticed that if the string starts or ends with delimiters (i.e. string = " dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d ") then the tokenizer will print empty strings (these are actually extra matches to string start and string end) in the beginning and end of its list of token outputs so to remedy this I tried the following (quite ugly) attempts at other regexes:

"(?:^|[ ]|$)+" - this seems quite simple and like it should work but it doesn't (and also seems to behave wildly different on other regex engines) for some reason it wouldn't build a single match from the string's start and the delimiters following it, the string start somehow also consumes the character following it! (this is also where I see divergence from other engines, is this a BUG? or does it have something to do with special non corporeal characters and the or (|) operator in python that I'm not aware of?), this solution also did nothing for the double match containing the string's end, once it matched the delimiters and then gave another match for the string end ($) character itself.
"(?:[ ]|$|^)+" - Putting the delimiters first actually solves one of the problems, the split at the beginning doesn't contain string start (but I don't care too much about that anyway since I'm interested in the tokens themselves), it also matches string start when there are no delimiters at the beginning of the string but the string ending is still a problem.
"(^[ ]*)|([ ]*$)|([ ]+)" - This final attempt got the string start to be part of the first match (which wasn't really that much of a problem in the first place) but try as I might I couldn't get rid of the delimiter + end and then delimiter match problem (which yields an additional empty string), still, I'm showing you this example (with grouping) since it shows that the ending special character $ is matched twice, once with the preceding delimiters and once by itself (2 group 2 matches).

My questions are:

Why do I get such a strange behavior in attempt #1
How do I solve the end of string issue?
Am I being a tank, i.e. is there a simple way to solve this that I'm blindly missing?
remember that the solution can't change the string and must produce an iterable generator which iterates on the spaces between the tokens and not the tokens themselves (This last part might seem to complicate the answer unnecessarily since otherwise I have a simple answer but if you must know (and if you don't read no further) it's part of a bigger framework I'm building where this yielding method is inherited by a pipeline which then constructs yielded sentences out of it in various patterns which are used to extract fields from semi structured classifier driven messages)

Just an idea - make sure your regex is correct, I see you do not use raw string for it, so some escaping might be messed up. Try changing `"^|[{0}]+|$".format(delimiters)` to `r"^|[{0}]+|$".format(delimiters)` — CrowbarKZ, Jan 22 '18 at 19:35
I tried it, thanks for the offer but it doesn't help, also calculated it on regex101.com which is where I noticed the difference between python and other languages when dealing with the caret '^' string start character. — Veltzer Doron, Jan 22 '18 at 19:38
@user2357112 it's hard to see, it's last printout is an empty string so it should print an eoln, try it here (https://regex101.com/) then (and be sure to use the 3rd regex engine (python)) — Veltzer Doron, Jan 22 '18 at 19:46
@user2357112 I think I'm gonna print the regexs' parse trees tomorrow. — Veltzer Doron, Jan 22 '18 at 20:02

user2357112 · Accepted Answer · 2018-01-22T20:52:35.567

The problems you're having are due to the trickiness and undocumented edge cases of zero-width matches. You can resolve them by using negative lookarounds to explicitly tell Python not to produce a match for ^ or $ if the string has delimiters at the start or end:

delimiter_re = r'[\n\- ]'     # newline, hyphen, or space
search_regex = r'''^(?!{0})   # string start with no delimiter
                   |          # or
                   {0}+       # sequence of delimiters (at least one)
                   |          # or
                   (?<!{0})$  # string end with no delimiter
                '''.format(delimiter_re)
search_pattern = re.compile(search_regex, re.VERBOSE)

Note that this will produce one match in an empty string, not zero, and not separate beginning and ending matches.

It may be simpler to iterate over non-delimiter sequences and use the resulting matches to locate the string components you want:

token = re.compile(r'[^\n\- ]+')
previous_end = 0
for match in token.finditer(string):
    do_something_with(string[previous_end:match.start()])
    previous_end = match.end()
do_something_with(string[previous_end:])

The extra matches you were getting at the end of the string were because after matching the sequence of delimiters at the end, the regex engine looks for matches at the end again, and finds a zero-width match for $.

The behavior you were getting at the beginning of the string for the ^|... pattern is trickier: the regex engine sees a zero-width match for ^ at the start of the string and emits it, without trying the other | alternatives. After the zero-width match, the engine needs to avoid producing that match again to avoid an infinite loop; this particular engine appears to do that by skipping a character, but the details are undocumented and the source is hard to navigate. (Here's part of the source, if you want to read it.)

The behavior you were getting at the start of the string for the (?:^|...)+ pattern is even trickier. Executing this straightforwardly, the engine would look for a match for (?:^|...) at the start of the string, find ^, then look for another match, find ^ again, then look for another match ad infinitum. There's some undocumented handling that stops it from going on forever, and this handling appears to produce a zero-width match, but I don't know what that handling is.

Very thorough answer. I guess the look ahead is unavoidable here. I'll examine the behaviours of $ and ^ furthur as they are very counter intuitive but for a solution I'll build a yielding generator according to your second suggestion. — Veltzer Doron, Jan 23 '18 at 06:01

xgord · Answer 2 · 2018-01-22T19:39:48.027

1

It sounds like you're just trying to return a list of all the "words" separated by any number of deliminating chars. You could instead just use regex groups and the negation regex ^ to achieve this:

# match any number of consecutive non-delim chars
string = "  dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d  "
delimiters = '\n\- '
regex = r'([^{0}]+)'.format(delimiters)
for match in re.finditer(regex, string):
    print(match.group(0))

output:

dasdha
hasud
hasuid
hsuia
dhsuai
dhasiu
dhaui
d

edited Jan 22 '18 at 19:39

answered Jan 22 '18 at 19:27

xgord

4,606
6
30
51

1

I need it to be an iterable (yielding) generator since it's part of a much bigger framework – Veltzer Doron Jan 22 '18 at 19:30
@VeltzerDoron sorry missed that part of your question. I changed my answer to use finditer, as your example does. does this fit the bill? – xgord Jan 22 '18 at 19:40
I voted you up for the effort but I need the spaces between the tokens yielded for reasons I elaborated on in the question body – Veltzer Doron Jan 22 '18 at 20:34
1

@VeltzerDoron To me, your edits made it even more confusing what you're trying to find. Could you edit your question to include: given your sample string with the leading and trailing spaces, what exactly is the output you want the generator to be iterating over? ex. if your goal isn't to iterate over `['dasdha', 'hasud', 'hasuid', 'hsuia', 'dhsuai', 'dhasiu', 'dhaui', 'd']` what is it you *are* trying to iterate over? – xgord Jan 22 '18 at 20:43
Basically I found a solution (imperfect one but nu shoyn), but the question can be narrowed down to, find a simple regex that matches delimiters and start and end of string. – Veltzer Doron Jan 23 '18 at 10:28

How do I build a tokenizing regex based iterator in python

2 Answers2