Why is this not a fixed width pattern?

Question

I'm trying to split English sentences correctly, and I came up with the unholy regex below:

(?<!\d|([A-Z]\.)|(\.[a-z]\.)|(\.\.\.)|etc\.|[Pp]rof\.|[Dd]r\.|[Mm]rs\.|[Mm]s\.|[Mm]z\.|[Mm]me\.)(?<=([\.!?])|(?<=([\.!?][\'\"])))[\s]+?(?=[\S])'

The problem is, Python keeps raising the following error:


Traceback (most recent call last):
  File "", line 1, in 
  File "sp.py", line 55, in analyze
    self.sentences = re.split(god_awful_regex, self.inputstr.strip())
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py", line 165, in split
    return _compile(pattern, 0).split(string, maxsplit)
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py", line 243, in _compile
    raise error, v # invalid expression
sre_constants.error: look-behind requires fixed-width pattern

Why is this not a valid, fixed-width regex? I'm not using any repeat characters (* or +), just |.

EDIT @Anomie solved the problem - thanks a ton! Unfortunately, I cannot make the new expression balance:

(?<!(\d))(?<![A-Z]\.)(?<!\.[a-z]\.)(?<!(\.\.\.))(?<!etc\.)(?<![Pp]rof\.)(?<![Dd]r\.)(?<![Mm]rs\.)(?<![Mm]s\.)(?<![Mm]z\.)(?<![Mm]me\.)(?:(?<=[\.!?])|(?<=[\.!?][\'\"\]))[\s]+?(?=[\S])

is what I have now. The number of ('s matches the number of ('s, though:

>>> god_awful_regex = r'''(?<!(\d))(?<![A-Z]\.)(?<!\.[a-z]\.)(?<!(\.\.\.))(?<!etc\.)(?<![Pp]rof\.)(?<![Dd]r\.)(?<![Mm]rs\.)(?<![Mm]s\.)(?<![Mm]z\.)(?<![Mm]me\.)(?:(?<=[\.!?])|(?<=[\.!?][\'\"\]))[\s]+?(?=[\S])'''
>>> god_awful_regex.count('(')
17
>>> god_awful_regex.count(')')
17
>>> god_awful_regex.count('[')
13
>>> god_awful_regex.count(']')
13

Any more ideas?

I have no idea, but maybe because [Pp]rof = 4 chars while [Mm]rs = 3 chars? — orlp, Mar 16 '11 at 23:53
About the unbalanced parentheses: At a quick glance, the problem appears to be that near the end of your regex, you have mistakenly escaped the closing bracket of a character class, thereby making the closing parentheses part of the class instead of their actual function. You have escaped more than necessary in other cases, too. Try this: `r'''(?<!(\d))(?<![A-Z]\.)(?<!\.[a-z]\.)(?<!(\.\.\.))(?<!etc\.)(?<![Pp]rof\.)(?<![Dd]r\.)(?<![Mm]rs\.)(?<![Mm]s\.)(?<![Mm]z\.)(?<![Mm]me\.)(?:(?<=[.!?])|(?<=[.!?]['"]))[\s]+?(?=[\S])'''` — Tim Pietzcker, Mar 17 '11 at 07:35
Also, you might want to simplify your regex by making it case-insensitive (compile it with the `re.I` option). — Tim Pietzcker, Mar 17 '11 at 07:37

score 13 · Accepted Answer · answered Mar 17 '11 at 00:08

Consider this subexpression:

(?<=([\.!?])|(?<=([\.!?][\'\"])))

The left side of the | is one character, while the right size is zero. You have the same issue in your larger negative look-behind too, it could be 1, 2, 3, 4, or 5 characters.

Logically, a negative look-behind of (?<!A|B|C) should be equivalent to a series of look-behinds (?<!A)(?<!B)(?<!C). A positive look-behind of (?<=A|B|C) should be equivalent to (?:(?<=A)|(?<=B)|(?<=C)).

miku · Answer 2 · 2011-03-17T00:10:36.517

This doesn't answer your question. However, if you want to split a text into sentences, you might want to take a look at nltk, which include beside many other things a PunktSentenceTokenizer. Here is some example tokenizer:

""" PunktSentenceTokenizer

A sentence tokenizer which uses an unsupervised algorithm to build a model
for abbreviation words, collocations, and words that start sentences; and then
uses that model to find sentence boundaries. This approach has been shown to
work well for many European languages. """

from nltk.tokenize.punkt import PunktSentenceTokenizer

tokenizer = PunktSentenceTokenizer()
print tokenizer.tokenize(__doc__)

# [' PunktSentenceTokenizer\n\nA sentence tokenizer which uses an unsupervised
# algorithm to build a model\nfor abbreviation words, collocations, and words
# that start sentences; and then\nuses that model to find sentence boundaries.',
# 'This approach has been shown to\nwork well for many European languages. ']

score -1 · Answer 3 · edited May 23 '17 at 10:27

-1

It looks like you might be using the repeat chacters near the end:

[\s]+?

Unless I'm reading that wrong.

UPDATE

Or vertical bar as nightcracker mentioned, and the first answer to this question seems to confirm: determine if regular expression only matches fixed-length strings

edited May 23 '17 at 10:27

Community

1
1

answered Mar 16 '11 at 23:53

Chris Cherry

28,118
6
68
71

Yes, but since it's AFTER the lookbehind it shouldn't affect it. – orlp Mar 16 '11 at 23:53
As nightcracker said the "OR" vertical bar is allowing strings of different lengths to be matched, maybe that counts? – Chris Cherry Mar 17 '11 at 00:07
According to the first answer to this question: http://stackoverflow.com/questions/3627570/determine-if-regular-expression-only-matches-fixed-length-strings the vertical bar could be the culprit – Chris Cherry Mar 17 '11 at 00:09

Why is this not a fixed width pattern?

3 Answers3

Linked