Python Regex Sentence Finder-Want to Ignore "a.m."

Question

I am developing a regex to find sentences, and I would like to ignore abbreviations that cause the regex to terminate before the end of the sentence. For example, I want to ignore "a.m." so that it returns "At 9:00 a.m. the store opens." instead of "At 9:00 a.m."

def sentence_finder(x):
    RegexObject = re.compile(r'[A-Z].+?\b(?!a\.m\.\b)\w+[.?!](?!\S)')
    Variable = RegexObject.findall(x)
    return Variable

I get back the following when I run pytest:

def test_pass_Ignore_am():
>       assert DuplicateSentences.sentence_finder("At 9:00 a.m. the store opens.") == ["At 9:00 a.m. the store opens."]
E       AssertionError: assert ['At 9:00 a.m.'] == ['At 9:00 a.m...store opens.']
E         At index 0 diff: 'At 9:00 a.m.' != 'At 9:00 a.m. the store opens.'

What am I doing wrong?

Try `[A-Z](?:a\.m\.|.)*?\w[.?!](?!\S)`, see https://regex101.com/r/78k0By/1 — Wiktor Stribiżew, Jan 27 '21 at 01:39
What about `p.m.` `e.g.` `Ave.` etc.? You should use a more universal rule other than only excluding `a.m.` — Hao Wu, Jan 27 '21 at 02:01
For previous work on a sentence finder you should check out my answer to [Using regular expression as a tokenizer](https://stackoverflow.com/questions/63870746/using-regular-expression-as-a-tokenizer/63871635#63871635). It worked perfectly on this [complex legal narrative](https://pastebin.com/y0XC3wig). — DarrylG, Jan 27 '21 at 02:06
Try [`(?=[A-Z]).+?[.?!](?=\s*(?:[A-Z]|$))`](https://regex101.com/r/HR7Ij3/1), it should work for most of the time. — Hao Wu, Jan 27 '21 at 02:11

The fourth bird · Answer 1 · 2021-01-27T08:12:31.540

You could use a negative lookbehind to check that after matching a dot, there is not a.m. before it.

[A-Z].*?\w[.?!](?<!\ba\.m\.)(?!\S)

Explanation

[A-Z] Match a char A-Z
.*? Match 0+ times any char except a newline as least as possible
\w[.?!] Match a word char followed by either . ? or !
(?<!\ba\.m\.) Negative lookbehind to assert that directly to the left is not a.m.
(?!\S) Assert a whitespace boundary to the right

Regex demo

Python Regex Sentence Finder-Want to Ignore "a.m."

1 Answers1