1

I am using the great answer provided by D Greenberg in the stackoverflow q&a Python split text on sentences to split text into sentences. I would like help augmenting one part of it.

The overall code uses a bunch of regular expressions to recognize abbreviations, acronyms, websites, prefixes (Mr., Mrs., etc.) and other non-sentence endings and changes u'.' into u'<prd>'. All the u'.' that aren't changed must be periods that end sentences.

The re that recognizes websites only works for URLs of the form text.(com|org|gov...). It doesn't work for text1.text2.text3.(com|org|gov...). May I have some help in making this work?

I have edited the original code to just the relevant section:

def split_into_sentences(text):
    prefixes = u"(Mr|St|Mrs|Ms|Dr)[.]"
    suffixes = u"(Inc|Ltd|Jr|Sr|Co)"
    websites = u"[.](com|net|org|io|gov)"
    digits = u"([0-9])"

    text = text.replace(u"\n",u" ")
    text = re.sub(prefixes,u"\\1<prd>",text)
    text = re.sub(websites,u"<prd>\\1",text)
    text = re.sub(digits + u"[.]" + digits,u"\\1<prd>\\2",text)
    if u"Ph.D" in text: text = text.replace(u"Ph.D.",u"Ph<prd>D<prd>")

    text = text.replace(u".",u".<stop>")
    text = text.replace(u"?",u"?<stop>")
    text = text.replace(u"!",u"!<stop>")

    text = text.replace(u"<prd>",u".")
    sentences = text.split(u"<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

I believe the following re will find a full URL or email address (I know there are more domains possible and I will augment if needed)

websites = ur"([\w@-]+[.])+(com|net|org|io|gov)"

What I can't figure out how to do is change the text = re.sub(websites,u"<prd>\\1",text) to accomplish what I want: in the portions of text that match the website pattern, change all of the u'.' into u'<prd>'

Community
  • 1
  • 1
racketteer
  • 70
  • 6

1 Answers1

0

You may use your pattern to match all those substrings in question and perform a custom search and replace on each match using a lambda expression used as the second argument to re.sub:

result = re.sub(websites, lambda x: x.group().replace(u".", u"<prd>"),text)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563