I'm trying to split English sentences correctly, and I came up with the unholy regex below:
(?<!\d|([A-Z]\.)|(\.[a-z]\.)|(\.\.\.)|etc\.|[Pp]rof\.|[Dd]r\.|[Mm]rs\.|[Mm]s\.|[Mm]z\.|[Mm]me\.)(?<=([\.!?])|(?<=([\.!?][\'\"])))[\s]+?(?=[\S])'
The problem is, Python keeps raising the following error:
Traceback (most recent call last):
File "", line 1, in
File "sp.py", line 55, in analyze
self.sentences = re.split(god_awful_regex, self.inputstr.strip())
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py", line 165, in split
return _compile(pattern, 0).split(string, maxsplit)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py", line 243, in _compile
raise error, v # invalid expression
sre_constants.error: look-behind requires fixed-width pattern
Why is this not a valid, fixed-width regex? I'm not using any repeat characters (* or +), just |.
EDIT @Anomie solved the problem - thanks a ton! Unfortunately, I cannot make the new expression balance:
(?<!(\d))(?<![A-Z]\.)(?<!\.[a-z]\.)(?<!(\.\.\.))(?<!etc\.)(?<![Pp]rof\.)(?<![Dd]r\.)(?<![Mm]rs\.)(?<![Mm]s\.)(?<![Mm]z\.)(?<![Mm]me\.)(?:(?<=[\.!?])|(?<=[\.!?][\'\"\]))[\s]+?(?=[\S])
is what I have now. The number of ('s matches the number of ('s, though:
>>> god_awful_regex = r'''(?<!(\d))(?<![A-Z]\.)(?<!\.[a-z]\.)(?<!(\.\.\.))(?<!etc\.)(?<![Pp]rof\.)(?<![Dd]r\.)(?<![Mm]rs\.)(?<![Mm]s\.)(?<![Mm]z\.)(?<![Mm]me\.)(?:(?<=[\.!?])|(?<=[\.!?][\'\"\]))[\s]+?(?=[\S])'''
>>> god_awful_regex.count('(')
17
>>> god_awful_regex.count(')')
17
>>> god_awful_regex.count('[')
13
>>> god_awful_regex.count(']')
13
Any more ideas?