1

This is for a school project on programming and im supposed to use only the re import.

I am trying to find all sentences in a text file containing certain expression defined by a parameter and extract them into a list. Searching other posts got me halfway there by finding the dots that start and end the sentence but if there is a number with a dot in there it ruins the result.

If I have a txt : This is a text. I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression. Its not working.

search = re.findall(r"([^.]*?"+expression+"[^.]*\.", txt)

The result I'm getting is ['576, I want to extract the phrase with this expression',]

The result I want is ['I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression.']

I'm still at beginner at this, any help?

  • First search for a dot between numbers, replace this by a comma. Then split your text and in the resulting phrases, look again for the numbers with the comma and replace that comma back by a dot. – Dominique Nov 23 '18 at 11:08

3 Answers3

0

If I am not wrong you want to split sentences. For this aim best regex is this:

sentences = re.split(r' *[\.\?!][\'"\)\]]* *', txt)

If this isn't work. You can replace extra points to commas in the sentence by this regex:

txt = re.sub(r'(\d*)\.(\d+)', r'\1,\2', txt)
gocen
  • 103
  • 1
  • 10
0

Tokenize the text into sentences with NLTK, and then use a whole word search or a regular substring check.

Example with a whole word search:

import nltk, re
text = "This is a text. I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression. Its not working."
sentences = nltk.sent_tokenize(text)
word = "expression"
print([sent for sent in sentences if re.search(r'\b{}\b'.format(word), sent)])
# => ['I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression.']

If you do not need a whole word search replace if re.search(r'\b{}\b'.format(word), sent) with if word in sent.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

Maybe not the best solution but you can match all sentences in the text and later find the expression, like this:

sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)

matching = [s for s in sentences if "I want to extract the phrase with this expression" in s]

print(matching)

#Result:
# ['I dont want for the result to stop in the number 990.576, I want to extract the phrase with this expression.']

Hope it helps!

Alejandro Barone
  • 1,743
  • 2
  • 13
  • 24