0

The question is:

Write a script that reads a text a file, splits it into sentences, and print the sentences on the screen, one after the other. Do not use libraries that do sentence-splitting for you.

The following is my code:

import re
fr=open('input.txt')
text=fr.read().strip()
fr.close()
Ms=re.finditer(' +([A-Z].+?\.) ',text)
for i in Ms:
    print i.group(1)

The result shows nothing. Actually I know what's maybe wrong because the first sentence of the file doesn't have multiple spaces ahead, but I can't figure out how to fix it.

The following is my input:

Metformin will reach full effectiveness in 6-8 weeks. It has three primary effects (http://en.wikipedia.org/wiki/Metformin#Mechanism4of_action).

First, it (frequently) reduces the amount of blood sugar produced by your liver, this presumably will decrease your basal needs and help your fasting numbers.

Second, metformin increases the insulin, signaling resulting in increased insulin sensitivity: http://care.diabetesjournals.org/content/27/1/281.full. The effect is primarily on the muscle mass in your body. Insulin resistance also affects all kinds of other stuff, but the biggest utilization of insulin is in the uptake of glucose to muscles.

Third, Metformin decreases the absorption of glucose during digestion.It is this effect that I believe causes some of the gastric issues.

3 Answers3

1

It's difficult to comment without seeing your input, but note that you need to be careful about leading and trailing spaces. In the example below, the first word is missed because it has no leading space, and the second sentence will be missed if you require a trailing space.

>>> text = "See Spot run. Run, Spot, run."

>>> re.findall(' +([A-Z].+?\.)',text)

['Spot run.',' Run, Spot, run.']

>>> re.findall(' +([A-Z].+?\.) ',text)

['Spot run. ']

We can do slightly better with character classes, but you need to decide exactly how sentences are demarcated.

>>> re.findall('([\w, ]+\.)',text)

['See Spot run.', ' Run, Spot, run.']

>>> re.findall('[^.]+\.',text)

['See Spot run.', ' Run, Spot, run.']

But splitting on a period will fail in many cases, such as the URL in your example input, or the following:

>>> re.findall('[^.]+\.',"See Dr. Spock run. Run, Spock, run.")

['See Dr.', ' Spock run.', ' Run, Spock, run.']
DNA
  • 42,007
  • 12
  • 107
  • 146
1

Assume file input.txt has the following content:

Metformin will reach full effectiveness in 6-8 weeks. It has three primary effects (http://en.wikipedia.org/wiki/Metformin#Mechanism4of_action).

First, it (frequently) reduces the amount of blood sugar produced by your liver, this presumably will decrease your basal needs and help your fasting numbers.

Second, metformin increases the insulin, signaling resulting in increased insulin sensitivity: http://care.diabetesjournals.org/content/27/1/281.full. The effect is primarily on the muscle mass in your body. Insulin resistance also affects all kinds of other stuff, but the biggest utilization of insulin is in the uptake of glucose to muscles.

Third, Metformin decreases the absorption of glucose during digestion.It is this effect that I believe causes some of the gastric issues.

Here is the code:

import re
with open('input.txt','r') as f: fin = f.read()
print re.sub('\.\s+', '.\n', fin)

Output:

Metformin will reach full effectiveness in 6-8 weeks.
It has three primary effects (http://en.wikipedia.org/wiki/Metformin#Mechanism4of_action).
First, it (frequently) reduces the amount of blood sugar produced by your liver, this presumably will decrease your basal needs and help your fasting numbers.
Second, metformin increases the insulin, signaling resulting in increased insulin sensitivity: http://care.diabetesjournals.org/content/27/1/281.full.
The effect is primarily on the muscle mass in your body.
Insulin resistance also affects all kinds of other stuff, but the biggest utilization of insulin is in the uptake of glucose to muscles.
Third, Metformin decreases the absorption of glucose during digestion.It is this effect that I believe causes some of the gastric issues.

One sentence is not parsed right, because of poor formatting (missing a space between two sentences), which should be fixed in the text file.

UPDATE That being said, please try the following with the text file unchanged:

import re
with open('input.txt','r') as f: fin = f.read()
print re.sub('\.\s*([A-Z])', '.\n\g<1>', fin)
Community
  • 1
  • 1
Quinn
  • 4,394
  • 2
  • 21
  • 19
0

try this:

import re

with open('input.txt', 'r') as f:
    data = f.read()

print('\n'.join(re.split(r'\n', data, re.M)))
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419