Using regular expression as a tokenizer?

Question

I am trying tokenize my corpus into sentences. I tried using spacy and nltk and they did not work well since my text is a bit tricky. Below is an artificial sample I made which covers all the edge cases I know:

It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that death
 to one cannot be generalised. However, the High Court while enhancing the same from life to 
death, in our view,has not assigned adequate and acceptable reasons. In our opinion, it is not a 
rarest of rare case where extreme penalty of death is called for instead sentence of 
imprisonment for life as ordered by the trial Court would be appropriate.15) In the light of the 
above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302. IPC, 
award of extreme penalty of death by the High Court is set aside and we restore the sentence of
 life imprisonment as directed by the trial Court.

How I would like the sentence to be tokenized:

1) It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that death to one cannot be generalised.
2) However, the High Court while enhancing the same from life to death, in our view,has not assigned adequate and acceptable reasons.
3) In our opinion, it is not a rarest of rare case where extreme penalty of death is called for instead sentence of imprisonment for life as ordered by the trial Court would be appropriate.
4)15. In the light of the above discussion, while
 maintaining the conviction of the appellant-accused for the offence under Section 302. IPC, 
award of extreme penalty of death by the High Court is set aside and we restore the sentence of
 life imprisonment as directed by the trial Court.

Here is the regular expression I am using:

sent = re.split('(?<!\w\.\w.)(?<![A-Z]\.)(?<![1-9]\.)(?<![1-9]\.)(?<![v]\.)(?<![vs]\.)(?<=\.|\?) ',j)

I am not really versed with regular expressions but I am manually putting in conditions for example v and vs. I am also ignoring if before te period there is a number for example 15.

Problems I am facing:

If there is no gap between two sentences it does not split properly.
I also would like it to ingore the period if the word before it is capitalized. For example No. or Mr.

DarrylG · Accepted Answer · 2020-09-18T06:01:08.733

In general you can't rely on one single Great White infallible regex, you have to write a function which uses several regexes (both positive and negative); also a dictionary of abbreviations, and some basic language parsing which knows that e.g. 'I', 'USA', 'FCC', 'TARP' are capitalized in English. Reference

Following this guideline the following function uses several regexes to parse your sentence Modification of D Greenberg answer

Code

import re

def split_into_sentences(text):
    # Regex pattern
    alphabets= "([A-Za-z])"
    prefixes = "(Mr|St|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Mt)[.]"
    suffixes = "(Inc|Ltd|Jr|Sr|Co)"
    starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
    acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
    # website regex from https://www.geeksforgeeks.org/python-check-url-string/
    websites = r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))"
    digits = "([0-9])"
    section = "(Section \d+)([.])(?= \w)"
    item_number = "(^|\s\w{2})([.])(?=[-+ ]?\d+)"
    abbreviations = "(^|[\s\(\[]\w{1,2}s?)([.])(?=[\s\)\]]|$)"
    parenthesized = "\((.*?)\)"
    bracketed = "\[(.*?)\]"
    curly_bracketed = "\{(.*?)\}"
    enclosed = '|'.join([parenthesized, bracketed, curly_bracketed])
    # text replacement
    # replace unwanted stop period with <prd>
    # actual stop periods with <stop>
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites, lambda m: m.group().replace('.', '<prd>'), text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    if "..." in text: text = text.replace("...","<prd><prd><prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    text = re.sub(section,"\\1<prd>",text)
    text = re.sub(item_number,"\\1<prd>",text)
    text = re.sub(abbreviations, "\\1<prd>",text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
    text = re.sub(enclosed, lambda m: m.group().replace('.', '<prd>'), text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")

    # Tokenize sentence based upon <stop>
    sentences = text.split("<stop>")
    if sentences[-1].isspace():
        # remove last since only whitespace
        sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]

    return sentences

Usage

for index, token in enumerate(split_into_sentences(s), start = 1):
    print(f'{index}) {token}')

Tests

1. Input

s='''It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that death
 to one cannot be generalised. However, the High Court while enhancing the same from life to 
death, in our view,has not assigned adequate and acceptable reasons. In our opinion, it is not a 
rarest of rare case where extreme penalty of death is called for instead sentence of 
imprisonment for life as ordered by the trial Court would be appropriate.15) In the light of the 
above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302. IPC, 
award of extreme penalty of death by the High Court is set aside and we restore the sentence of
 life imprisonment as directed by the trial Court.
'''

Output

1) It is relevant to point that Case No. 778 - Martin H. v. The Woods, it was mentioned that death  to one cannot be generalised.
2) However, the High Court while enhancing the same from life to  death, in our view,has not assigned adequate and acceptable reasons.
3) In our opinion, it is not a  rarest of rare case where extreme penalty of death is called for instead sentence of  imprisonment for life as ordered by the trial Court would be appropriate.
4) 15) In the light of the  above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302. IPC,  award of extreme penalty of death by the High Court is set aside and we restore the sentence of  life imprisonment as directed by the trial Court.

2. Input

s = '''Mr. or Mrs. or Dr. (not sure of their title) Smith will be here in the morning at eight.He's arriving on flight No. 48213 out of Denver.He'll take the No. 2 bus from the airport.However, he may grab a taxi instead.'''

Output

1) Mr. or Mrs. or Dr. (not sure of their title) Smith will be here in the morning at eight.
2) He's arriving on flight No. 48213 out of Denver.
3) He'll take the No. 2 bus from the airport.
4) However, he may grab a taxi instead.

3. Input

s = '''The respondent, in his statement Ex.-73, which is accepted and found to be truthful. The passcode is either No.5, No. 5, No.-5, No.+5.'''

Output

1) The respondent, in his statement Ex.-73, which is accepted and found to be truthful.
2) The passcode is either No.5, No. 5, No.-5, No.+5.

4. Input

s = '''He went to New York. He is 10 years old.'''

Output

1) He went to New York.
2) He is 10 years old.

5. Input

s = '''15) In the light of  Ex. P the above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302 IPC, award of extreme penalty of death by the High Court is set aside and we restore the sentence of life imprisonment as directed by the trial Court. The appeal is allowed in part to the extent mentioned above.'''

Output

1) 15) In the light of  Ex. P the above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302 IPC, award of extreme penalty of death by the High Court is set aside and we restore the sentence of life imprisonment as directed by the trial Court.
2) The appeal is allowed in part to the extent mentioned above.

This works really good except for one thing. How do I get it to ignore when the period is before a word that is capitalized? Any idea how to do that? @DarryIG — Shawn, Sep 15 '20 at 08:09
@Shawn--the updated answer shows a solution to your intended problem cases. Is this what you expect? — DarrylG, Sep 15 '20 at 08:50
Hi Darry, I am afraid not. What I want is if there is a captialized word before a period, for example - ```Eg.``` or ```No.```. The thing is I can't manually put them up because everytime I see a new one. Here's a sentence that's giving trouble: ```The respondent, in his statement Ex.-73, which is accepted and found to be truthful.``` This is a single sentence but get's split into two. How do I stop that? — Shawn, Sep 15 '20 at 12:28
I am okay if this causes sentences like this to split the wrong way: ```He went to New York. He is 10 years old.``` — Shawn, Sep 15 '20 at 12:30
Hi Darryl! Not sure what's happening but some sentences keep getting cut awkwardly. For example the sentence below gets abruptly cut at ```Ex.```. ```15) In the light of Ex. P the above discussion, while maintaining the conviction of the appellant-accused for the offence under Section 302 IPC, award of extreme penalty of death by the High Court is set aside and we restore the sentence of life imprisonment as directed by the trial Court. The appeal is allowed in part to the extent mentioned above.``` — Shawn, Sep 16 '20 at 13:05
Hi Darry, It keeps breaking in another scenarios. I think the problem is that I have been only posting snippets I guess. So here is a pastebin https://pastebin.com/y0XC3wig, I would love if you could go through it and thanks a ton for the so many iterations you have already done! — Shawn, Sep 17 '20 at 13:10
Here's a case that is currently failing: ```At the same time, spot Panchanama (Ex. 24) was drawn by PW-14 and he also seized the articles found lying there including wooden rafter having stains of blood and a big stone. Since the condition of.``` — Shawn, Sep 17 '20 at 13:26
@Shawn--"Since the condition of."-- Isn't that a bad sentence? I'll take a look though. — DarrylG, Sep 17 '20 at 13:27
@DarryIG sorry my bad. Here is the correct one - ```At the same time, spot Panchanama (Ex. 24) was drawn by PW-14 and he also seized the articles found lying there including wooden rafter having stains of blood and a big stone.``` — Shawn, Sep 17 '20 at 13:32
@Shawn--that last text was easily fixed but will try later today to create a version that handles the text in your link. — DarrylG, Sep 17 '20 at 15:10
@Shawn--updated code. Seems to work well against code in your pastebin link. I'll be interested in your test? — DarrylG, Sep 18 '20 at 06:04
Hey Darryl! Just checked and it works perfectly. Thanks a ton for taking so much of your time to help a random stranger! Cheers! — Shawn, Sep 18 '20 at 17:39
@Shawn--glad I could help. I looked upon it as a fun challenge. — DarrylG, Sep 19 '20 at 08:20

score 0 · Answer 2 · answered Sep 13 '20 at 14:09

Are you looking for below regex:

'(?<=[^A-Z][a-z]\w)[/.] '

Explanation:

[^A-Z][a-z]\w)[/.] --> This will match all the words that are not starting with uppercase, followed by a '.' and a space.
(?<=....) --> This will reset whatever has been selected, and just select whatever comes next, i.e., select '. ' only.

Now this can be used in split:

sent=re.split('(?<=[^A-Z][a-z]\w)[/.] ',j)

Using regular expression as a tokenizer?

2 Answers2

Linked