I am using the great answer provided by D Greenberg in the stackoverflow q&a Python split text on sentences to split text into sentences. I would like help augmenting one part of it.
The overall code uses a bunch of regular expressions to recognize abbreviations, acronyms, websites, prefixes (Mr., Mrs., etc.) and other non-sentence endings and changes u'.'
into u'<prd>'
. All the u'.'
that aren't changed must be periods that end sentences.
The re that recognizes websites only works for URLs of the form text.(com|org|gov...)
. It doesn't work for text1.text2.text3.(com|org|gov...)
. May I have some help in making this work?
I have edited the original code to just the relevant section:
def split_into_sentences(text):
prefixes = u"(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = u"(Inc|Ltd|Jr|Sr|Co)"
websites = u"[.](com|net|org|io|gov)"
digits = u"([0-9])"
text = text.replace(u"\n",u" ")
text = re.sub(prefixes,u"\\1<prd>",text)
text = re.sub(websites,u"<prd>\\1",text)
text = re.sub(digits + u"[.]" + digits,u"\\1<prd>\\2",text)
if u"Ph.D" in text: text = text.replace(u"Ph.D.",u"Ph<prd>D<prd>")
text = text.replace(u".",u".<stop>")
text = text.replace(u"?",u"?<stop>")
text = text.replace(u"!",u"!<stop>")
text = text.replace(u"<prd>",u".")
sentences = text.split(u"<stop>")
sentences = sentences[:-1]
sentences = [s.strip() for s in sentences]
return sentences
I believe the following re will find a full URL or email address (I know there are more domains possible and I will augment if needed)
websites = ur"([\w@-]+[.])+(com|net|org|io|gov)"
What I can't figure out how to do is change the text = re.sub(websites,u"<prd>\\1",text)
to accomplish what I want: in the portions of text that match the website pattern, change all of the u'.'
into u'<prd>'