3

I am developing a regex to find sentences, and I would like to ignore abbreviations that cause the regex to terminate before the end of the sentence. For example, I want to ignore "a.m." so that it returns "At 9:00 a.m. the store opens." instead of "At 9:00 a.m."

def sentence_finder(x):
    RegexObject = re.compile(r'[A-Z].+?\b(?!a\.m\.\b)\w+[.?!](?!\S)')
    Variable = RegexObject.findall(x)
    return Variable

I get back the following when I run pytest:

def test_pass_Ignore_am():
>       assert DuplicateSentences.sentence_finder("At 9:00 a.m. the store opens.") == ["At 9:00 a.m. the store opens."]
E       AssertionError: assert ['At 9:00 a.m.'] == ['At 9:00 a.m...store opens.']
E         At index 0 diff: 'At 9:00 a.m.' != 'At 9:00 a.m. the store opens.'

What am I doing wrong?

  • 1
    Try `[A-Z](?:a\.m\.|.)*?\w[.?!](?!\S)`, see https://regex101.com/r/78k0By/1 – Wiktor Stribiżew Jan 27 '21 at 01:39
  • What about `p.m.` `e.g.` `Ave.` etc.? You should use a more universal rule other than only excluding `a.m.` – Hao Wu Jan 27 '21 at 02:01
  • For previous work on a sentence finder you should check out my answer to [Using regular expression as a tokenizer](https://stackoverflow.com/questions/63870746/using-regular-expression-as-a-tokenizer/63871635#63871635). It worked perfectly on this [complex legal narrative](https://pastebin.com/y0XC3wig). – DarrylG Jan 27 '21 at 02:06
  • 1
    Try [`(?=[A-Z]).+?[.?!](?=\s*(?:[A-Z]|$))`](https://regex101.com/r/HR7Ij3/1), it should work for most of the time. – Hao Wu Jan 27 '21 at 02:11

1 Answers1

1

You could use a negative lookbehind to check that after matching a dot, there is not a.m. before it.

[A-Z].*?\w[.?!](?<!\ba\.m\.)(?!\S)

Explanation

  • [A-Z] Match a char A-Z
  • .*? Match 0+ times any char except a newline as least as possible
  • \w[.?!] Match a word char followed by either . ? or !
  • (?<!\ba\.m\.) Negative lookbehind to assert that directly to the left is not a.m.
  • (?!\S) Assert a whitespace boundary to the right

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70