3

I am trying to match spaces just after punctuation marks so that I can split up a large corpus of text, but I am seeing some common edge cases with places, titles and common abbreviations:

I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith

I am using this with the re.split function in Python 3 I want to get this:

["I am from New York, N.Y. and I would like to say hello!",
"How are you today?",
"I am well.",
"I owe you $6. 00 because you bought me a No. 3 burger."
"-Sgt. Smith"]

This is currently my regex:

(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)(?<=[^N]..)(?<=[^o].)

I decided to try to fix the No. first, with the last two conditions. But it relies on matching the N and the o independently which I think is going to case false positives elsewhere. I cannot figure out how to get it to make just the string No behind the period. I will then use a similar approach for Sgt. and any other "problem" strings I come across.

I am trying to use something like:

(?<=[\.\?\!])(?<=[^A-Z].)(?<=[^0-9].)^(?<=^No$)

But it doesn't capture anything after that. How can I get it to exclude certain strings which I expect to have a period in it, and not capture them?

Here is a regexr of my situation: https://regexr.com/4sgcb

wfgeo
  • 2,716
  • 4
  • 30
  • 51
  • Have you tried nesting some lookarounds? For instance `(?<=(?<!No)[\.\?\!]) ` (there's a trailing space) matches 1 space preceded by a punctuation sign as long as the latter is not preceded by `No`. However I don't know Python's regexes, so I don't know how to evercome the limitation that _a lookbehind assertion has to be fixed width_, which makes this `(?<=(?<!(No|Sgt))[\.\?\!]) ` error, for instance. – Enlico Jan 18 '20 at 21:36
  • This might have to lot of edge cases, but for your example data you might try `(?<=[.?!](?! [a-z0-9]))(?<!-Sgt\.) ` See https://regex101.com/r/RMn6dY/1 Or if you can make use of the regex PyPi module you might use `(?<=[.?!](?! [a-z0-9]))(?<![^\w\s]\S*\.) ` https://regex101.com/r/AC9Hrv/1 – The fourth bird Jan 18 '20 at 21:39
  • @Thefourthbird, in theory `()` and `[]` should be enough a tool together to deal with edge cases too. But the resulting regex could be indeed long unreadable (just like when you try to mach a number in a numeric interval). – Enlico Jan 18 '20 at 22:11
  • 1
    Can you narrow the potential edge cases down to a fixed set? Becasue if not, I doubt that you will be able to do this without false positives or false negatives. Lets take "No." for example. Yes, in the context you have given that referrs to a short Form of "Number" and is not the end of a sentence. But, what about "No. I am your father" or "My answer is No."? Its impossible to destinguish these without context. For "No" it might be enough to just check for subsequent numbers but for others it might wont. For example: "The title sergeant is often abbreviated with the letters Sgt." – Benjamin Basmaci Jan 18 '20 at 22:13
  • Why do you want to use a complicated regex for this when you can solve it with 2 simple regexes (`[\.\?!]`, `(?:[A-Z]\.){2,}|No\.)` a loop and a condition... – inf3rno Jan 18 '20 at 22:36

4 Answers4

2

This is the closest regex I could get (the trailing space is the one we match):

(?<=(?<!(No|\.\w))[\.\?\!])(?! *\d+ *) 

which will split also after Sgt. for the simple reason that a lookbehind assertion has to be fixed width in Python (what a limitation!).

This is how I would do it in vim, which has no such limitation (the trailing space is the one we match):

\(\(No\|Sgt\|\.\w\)\@<![?.!]\)\( *\d\+ *\)\@!\zs 

For the OP as well as the casual reader, this question and the answers to it are about lookarounds and are very interesting.

Enlico
  • 23,259
  • 6
  • 48
  • 102
2

You may consider a matching approach, it will offer you better control over the entities you want to count as single words, not as sentence break signals.

Use a pattern like

\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))

See the regex demo

It is very similar to what I posted here, but it contains a pattern to match poorly formatted float numbers, added No. and Sgt. abbreviation support and a better handling of strings not ending with final sentence punctuation.

Python demo:

import re
p = re.compile(r'\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))')
s = "I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith"
for m in p.findall(s):
    print(m)

Output:

I am from New York, N.Y. and I would like to say hello!
How are you today?
I am well.
I owe you $6. 00 because you bought me a No. 3 burger.
-Sgt. Smith

Pattern details

  • \s* - matches 0 or more whitespace (used to trim the results)
  • (?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+ - one or more occurrences of several aternatives:
    • \d+\.\s*\d+ - 1+ digits, ., 0+ whitespaces, 1+ digits
    • (?:No|M[rs]|[JD]r|S(?:r|gt))\. - abbreviated strings like No., Mr., Ms., Jr., Dr., Sr., Sgt.
    • \.(?!\s+-?[A-Z0-9]) - matches a dot not followed by 1 or more whitespace and then an optional - and uppercase letters or digits
    • | - or
    • [^.!?] - any character but a ., !, and ?
  • (?:[.?!]|$) - a ., !, and ? or end of string.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
2

As mentioned in my comment above, if you are not able to define a fixed set of edge cases, this might not be possible without false positives or false negatives. Again, without context you are not able to destinguish between abbreviations like "-Sgt. Smith" and ends of sentences like "Sergeant is often times abbreviated as Sgt. This makes it shorter.".

However, if you can define a fixed set of edge cases, its probably easier and much more readable to do this in multiple steps.


1. Identify your edge cases

For example, you can destinguish "Ill have a No. 3" and "No. I am your father" by checking for a subsequent number. So you would identify that edge case with a regex like this: No. \d. (Again, context matters. Sentences like "Is 200 enough? No. 200 is not enough." will still give you a false positive)

2. Mask your edge cases

For each edge case, mask the string with a respective string that will 100% not be part of the original text. E.g. "No." => "======NUMBER======"

3. Run your algorithm

Now that you got rid of your unwanted punctuations, you can run a simpler regex like this to identify the true positives: [\.\!\?]\s

4. Unmask your edge cases

Turn "======NUMBER======" back into "No."

Benjamin Basmaci
  • 2,247
  • 2
  • 25
  • 46
1

Doing it with only one regex will be tricky - as stated in comments, there are lots of edge cases.

Myself I would do it with three steps:

  1. Replace spaces that should stay with some special character (re.sub)
  2. Split the text (re.split)
  3. Replace the special character with space

For example:

import re

zero_width_space = '\u200B'

s = 'I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith'

s = re.sub(r'(?<=\.)\s+(?=[\da-z])|(?<=,)\s+|(?<=Sgt\.)\s+', zero_width_space, s)
s = re.split(r'(?<=[.?!])\s+', s)

from pprint import pprint
pprint([line.replace(zero_width_space, ' ') for line in s])

Prints:

['I am from New York, N.Y. and I would like to say hello!',
 'How are you today?',
 'I am well.',
 'I owe you $6. 00 because you bought me a No. 3 burger.',
 '-Sgt. Smith']
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • 1
    Thanks everyone for the really detailed answers, I went with this one because of its pure simplicity. I am aware that it would be practically impossible to cover all possible edge cases - I am just looking for a way to rectify a majority of them. I did however tweak the first expression to `(?<=[\dA-Z]\.)\s+|(?<=No\.)\s+` because I can be reasonably certain that a "normal" sentence will not end with a capital. I parameterize this in python to use a list of "exception" strings in place of `No`, like `Jan.`, `Feb.`, `Col.`, etc. Probably not the most efficient but it works for my small data – wfgeo Jan 18 '20 at 23:22