Regular Expressions with python

Question

The question is:

Write a script that reads a text a file, splits it into sentences, and print the sentences on the screen, one after the other. Do not use libraries that do sentence-splitting for you.

The following is my code:

import re
fr=open('input.txt')
text=fr.read().strip()
fr.close()
Ms=re.finditer(' +([A-Z].+?\.) ',text)
for i in Ms:
    print i.group(1)

The result shows nothing. Actually I know what's maybe wrong because the first sentence of the file doesn't have multiple spaces ahead, but I can't figure out how to fix it.

The following is my input:

Metformin will reach full effectiveness in 6-8 weeks. It has three primary effects (http://en.wikipedia.org/wiki/Metformin#Mechanism4of_action).

First, it (frequently) reduces the amount of blood sugar produced by your liver, this presumably will decrease your basal needs and help your fasting numbers.

Second, metformin increases the insulin, signaling resulting in increased insulin sensitivity: http://care.diabetesjournals.org/content/27/1/281.full. The effect is primarily on the muscle mass in your body. Insulin resistance also affects all kinds of other stuff, but the biggest utilization of insulin is in the uptake of glucose to muscles.

Third, Metformin decreases the absorption of glucose during digestion.It is this effect that I believe causes some of the gastric issues.

Have you done any debugging? With what results? What *is* in the file? — jonrsharpe, Feb 26 '16 at 21:43
runfile('C:/Users/Air/Desktop/660/week6/assignment.py', wdir='C:/Users/Air/Desktop/660/week6') Just nothing shows up. — Zhongyang Sheng, Feb 26 '16 at 21:45
That's just running it, not debugging it. Please give a [mcve]. — jonrsharpe, Feb 26 '16 at 21:45
You stripped the contents you read with `.strip()`. That is why there is no whitespace at the start. — Wiktor Stribiżew, Feb 26 '16 at 21:50
Please clarify how sentences might end. Only with `.`? Are `!` and `?` and `"` possible? Is there perhaps a `Dr.` in it? — gil, Feb 26 '16 at 21:51
Possible duplicate of [Python - RegEx for splitting text into sentences (sentence-tokenizing)](http://stackoverflow.com/questions/25735644/python-regex-for-splitting-text-into-sentences-sentence-tokenizing) — DNA, Feb 26 '16 at 23:10

DNA · Answer 1 · 2016-02-26T23:02:43.073

It's difficult to comment without seeing your input, but note that you need to be careful about leading and trailing spaces. In the example below, the first word is missed because it has no leading space, and the second sentence will be missed if you require a trailing space.

>>> text = "See Spot run. Run, Spot, run."

>>> re.findall(' +([A-Z].+?\.)',text)

['Spot run.',' Run, Spot, run.']

>>> re.findall(' +([A-Z].+?\.) ',text)

['Spot run. ']

We can do slightly better with character classes, but you need to decide exactly how sentences are demarcated.

>>> re.findall('([\w, ]+\.)',text)

['See Spot run.', ' Run, Spot, run.']

>>> re.findall('[^.]+\.',text)

['See Spot run.', ' Run, Spot, run.']

But splitting on a period will fail in many cases, such as the URL in your example input, or the following:

>>> re.findall('[^.]+\.',"See Dr. Spock run. Run, Spock, run.")

['See Dr.', ' Spock run.', ' Run, Spock, run.']

Thank you, sir. The content of my input has been added on the screen. — Zhongyang Sheng, Feb 26 '16 at 22:57

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

Assume file input.txt has the following content:

Metformin will reach full effectiveness in 6-8 weeks. It has three primary effects (http://en.wikipedia.org/wiki/Metformin#Mechanism4of_action).

First, it (frequently) reduces the amount of blood sugar produced by your liver, this presumably will decrease your basal needs and help your fasting numbers.

Second, metformin increases the insulin, signaling resulting in increased insulin sensitivity: http://care.diabetesjournals.org/content/27/1/281.full. The effect is primarily on the muscle mass in your body. Insulin resistance also affects all kinds of other stuff, but the biggest utilization of insulin is in the uptake of glucose to muscles.

Third, Metformin decreases the absorption of glucose during digestion.It is this effect that I believe causes some of the gastric issues.

Here is the code:

import re
with open('input.txt','r') as f: fin = f.read()
print re.sub('\.\s+', '.\n', fin)

Output:

Metformin will reach full effectiveness in 6-8 weeks.
It has three primary effects (http://en.wikipedia.org/wiki/Metformin#Mechanism4of_action).
First, it (frequently) reduces the amount of blood sugar produced by your liver, this presumably will decrease your basal needs and help your fasting numbers.
Second, metformin increases the insulin, signaling resulting in increased insulin sensitivity: http://care.diabetesjournals.org/content/27/1/281.full.
The effect is primarily on the muscle mass in your body.
Insulin resistance also affects all kinds of other stuff, but the biggest utilization of insulin is in the uptake of glucose to muscles.
Third, Metformin decreases the absorption of glucose during digestion.It is this effect that I believe causes some of the gastric issues.

One sentence is not parsed right, because of poor formatting (missing a space between two sentences), which should be fixed in the text file.

UPDATE That being said, please try the following with the text file unchanged:

import re
with open('input.txt','r') as f: fin = f.read()
print re.sub('\.\s*([A-Z])', '.\n\g<1>', fin)

There is still a problem with the last two sentences. Because there is no space between the last two so it can't split up. — Zhongyang Sheng, Feb 27 '16 at 01:36

score 0 · Answer 3 · answered Feb 26 '16 at 21:45

0

try this:

import re

with open('input.txt', 'r') as f:
    data = f.read()

print('\n'.join(re.split(r'\n', data, re.M)))

answered Feb 26 '16 at 21:45

MaxU - stand with Ukraine

205,989
36
386
419

Regular Expressions with python

3 Answers3