split text by periods except in certain cases

Question

I am currently trying to split a string containing an entire text document by sentences so that I can convert it to a csv. Naturally, I would use periods as the delimiter and perform str.split('.'), however, the document contains abbreviations 'i.e.' and 'e.g.' which I would want to ignore the periods in this case.

For example,

Original Sentence: During this time, it became apparent that vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing. ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors.

Resulting List: ["During this time, it became apparent that vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing", "ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors."]

My only workaround so far is replacing all 'i.e' and 'e.g.' with 'ie' and 'eg' which is both inefficient and grammatically undesirable. I am fiddling with Python's regex library which I suspect holds the answer I desire but my knowledge with it is novice at best.

It is my first time posting a question on here so I apologize if I am using incorrect format or wording.

Perhaps see the 118 upvotes answer on [This function can split the entire text of Huckleberry Finn into sentences in about 0.1 seconds and handles many of the more painful edge cases that make sentence parsing non-trivial e.g. "Mr. John Johnson Jr. was born](https://stackoverflow.com/questions/4576077/how-can-i-split-a-text-into-sentences) — MDR, Jul 12 '21 at 01:23
The function may work out for you. Example: https://ibb.co/FB4GX2m — MDR, Jul 12 '21 at 01:28
It's a cool toy example. There's an entire field of study dedicated to this question, and it's not regex. — smcjones, Jul 12 '21 at 01:53

Greg W.F.R · Answer 1 · 2021-07-12T02:02:11.420

This one should work!

import re

p = "During this time, it became apparentt hat vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing. ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors."

list = []
while(len(p) > 0):
 string = ""
 while(True):
  match = re.search("[A-Z]+[^A-Z]+",p)
  if(match == None):
      break
  p = p[len(match.group(0)):]
  string += match.group(0)
  if(match.group(0).endswith(". ") ):
      break
 list.append(string)



print(list)

Using `string.split(". ")` still does not give the desired result as it splits the sentence after the second period in 'i.e. goals held by..." — jx-zh, Jul 12 '21 at 00:46

smcjones · Accepted Answer · 2021-07-12T01:49:22.533

See How can I split a text into sentences? which suggests the natural language toolkit.

A deeper explanation as to why this is how it is done by way of an example:

I go by the name of I. Brown. I bet I could make a sentence difficult to parse. No one is more suited to this task than I.

How do you break this into different sentences?

You need semantics (a formal sentence is usually made up of a subject, an object, and a verb) which a regular expression won't capture. RegEx does syntax very well, but not semantics (meaning).

To prove this, the answer someone else suggested that involves a lot of complex regex and is fairly slow, with 115 votes, would break with my simple sentence.

It's an NLP problem, so I linked to an answer that gave an NLP package.

score 1 · Answer 3 · edited Jul 12 '21 at 01:59

This is a crude implementation.

inp = input()
res = []
last = 0
for x in range(len(inp)):
    if (x>1):
        if (inp[x] == "." and inp[x-2] != "."):
            if (x < len(inp)-2):
                if (inp[x+2] != "."):
                    res.append(inp[last:x])
                    last = x+2
res.append(inp[last:-1])
print(res)

If I use your input, I get this output (hopefully, this is what you are looking for):

['During this time, it became apparent that vanilla shortest-path routing would be insufficient to handle the myriad operational, economic, and political factors involved in routing', 'ISPs began to modify routing configurations to support routing policies, i.e. goals held by the router’s owner that controlled which routes were chosen and which routes were propagated to neighbors']

Note: You might have to adjust this code if the text you are using does not follow grammar rules (no spaces between letters or after starting a new sentence...)

split text by periods except in certain cases

3 Answers3