0

I'm trying to split a paragraph into sentences using regex split and I'm trying to use the second answer posted here: a Regex for extracting sentence from a paragraph in python

But I have a list of abbreviations that I don't want to end the sentence on even though there's a period. But I don't know how to append it to that regular expression properly. I'm reading in the abbreviations from a file that contains terms like Mr. Ms. Dr. St. (one on each line).

Community
  • 1
  • 1
user2017502
  • 215
  • 6
  • 15

2 Answers2

1

Short answer: You can't, unless all lookbehind assertions are of the same, fixed width (which they probably aren't in your case; your example contained only two-letter abbreviations, but Mrs. would break your regex).

This is a limitation of the current Python regex engine.

Longer answer:

You could write a regex like (?s)(?<!.Mr|Mrs|.Ms|.St)\., padding each alternating part of the lookbehind assertion with as many .s as needed to get all of them to the same width. However, that would fail in some circumstances, for example when a paragraph begins with Mr..

Anyway, you're not using the right tool here. Better use a tool designed for the job, for example the Natural Language Toolkit.

If you're stuck with regex (too bad!), then you could try and use a findall() approach instead of split():

(?:(?:\b(?:Mr|Ms|Dr|Mrs|St)\.)|[^.])+\.\s*

would match a sentence that ends in . (optionally followed by whitespace) and may contain no dots unless preceded by one of the allowed abbreviations.

>>> import re
>>> s = "My name is Mr. T. I pity the fool who's not on the A-Team."
>>> re.findall(r"(?:(?:\b(?:Mr|Ms|Dr|Mrs|St)\.)|[^.])+\.\s*", s)
['My name is Mr. T. ', "I pity the fool who's not on the A-Team."]
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • I can only use the python stdlib. Is there any other way to take the abbreviations into consideration without using lookbehinds then? – user2017502 Jan 28 '13 at 09:24
  • *“This is a limitation of the current Python regex engine.”* – Do you have something for me to read about this limitation? – poke Jan 28 '13 at 10:10
  • @poke: The [docs](http://docs.python.org/2/library/re.html#regular-expression-syntax) read: "`(?<=...)`: Matches if the current position in the string is preceded by a match for ... that ends at the current position. This is called a positive lookbehind assertion. `(?<=abc)def` will find a match in `abcdef`, since the lookbehind will back up 3 characters and check if the contained pattern matches. The contained pattern must only match strings of some fixed length, meaning that `abc` or `a|b` are allowed, but `a*` and `a{3,4}` are not. " – Tim Pietzcker Jan 28 '13 at 10:59
  • @TimPietzcker: Look-behind for a list of fixed text can be done (to some extent, with certain limitations). – nhahtdh Jan 28 '13 at 11:39
1

I don't directly answer your question, but this post should contain enough information for you to write a working regex for your problem.

You can append a list of negative look-behinds. Remember that look-behinds are zero-width, which means that you can put as many look-behinds as you want next to each other, and you are still look-behind from the same position. As long as you don't need to use "many" quantifier (e.g. *, +, {n,}) in the look-behind, everything should be fine (?).

So the regex can be constructured like this:

(?<!list )(?<!of )(?<!words )(?<!not )(?<!allowed )(?<!to )(?<!precede )pattern\w+

It is a bit too verbose. Anyway, I write this post just to demonstrate that it is possible to look-behind on a list of fixed string.

Example run:

>>> s = 'something patterning of patterned crap patternon not patterner, not allowed patternes to patternsses, patternet'
>>> re.findall(r'(?<!list )(?<!of )(?<!words )(?<!not )(?<!allowed )(?<!to )(?<!precede )pattern\w+', s)
['patterning', 'patternon', 'patternet']

There is a catch in using look-behind, though. If there are dynamic number of spaces between the blacklisted text and the text matching the pattern, the regex above will fail. I really doubt there exists a way to modify the regex so that it works for the case above while keeping the look-behinds. (You can always replace consecutive spaces into 1, but it won't work for more general cases).

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • I have formatted my long list of literal text into a one long string like this: (?<!Ala\. )(?<!Ariz\. )(?<!Assn\. )(?<!Atty\. )(?<!Aug\. )(?<!Ave\. )(?<!Bldg\. )(?<!Blvd\. )(?<!Calif\. )(?<!Capt\. )(?<!Cf\. )(?<!Ch\. )(?<!Co\. )(?<!Col\. )(?<!Colo\. )(?<!Conn\. )(?<!Corp\. )(?<!DR\. )(?<!Dec\. )(?<!Dept\. )(?<!Dist\. )(?<!Dr\. )(?<!Drs\. )(?<!Ed\. )(?<!Eq\. )(?<!FEB\. )(?<!Feb\. )(?<!Fig\. )(?<!Figs\. )(?<!Fla\. )(?<!Ga\. )(?<!Gen\. )(?<!Gov\. )(?<!HON\. )(?<!Ill\. )(?<!Inc\. )(?<!JR\. )(?<!Jan\. )(?<!Jr\. )(?<!Kan\. )(?<!Ky\. )(?<!La\. )(?<!Lt\. )(?<!Ltd\. )(?<!MR\. ) Would it work? – user2017502 Jan 29 '13 at 02:18
  • @user2017502: Look at the catch when using this. It is going to fail for the case `Mr.__X` (`_` is space - 2 spaces between `.` and name) – nhahtdh Jan 29 '13 at 03:44