How to get sentences from a paragraph with custom list of words in Python

Question

I am trying to read a paragraph and capture all the sentences in it with words matching a dynamic list of words.

The python pre-processing steps will identify the list of words. I want to use this list of words and identify sentences in the paragraph that has at least one of the words from the list. All those identified sentences will be appended to a new variable.

Input: "Machine learning is the science of getting computers to act without being explicitly programmed. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI."

list of words: computer, researcher

Output: Machine learning is the science of getting computers to act without being explicitly programmed.Many researchers also think it is the best way to make progress towards human-level AI.

What is the best way to accomplish this ?

What is your way to accomplish this? Please show your attempt — styvane, Jul 14 '15 at 16:02
In future it is best to provide an example of what you have tried so we understand more about what you want - it also shows that you don't simply want us to do your work for you. — NDevox, Jul 14 '15 at 16:07
The point of the exercise is to improve yourself, you're missing out if you get someone else to do the thinking for you. — Peter Wood, Jul 14 '15 at 16:08

Avinash Raj · Answer 1 · 2015-07-14T16:11:31.437

0

Give a try to his,

lst = ['computer', 'researcher']
''.join(re.findall(r'[^.]*(?:' + '|'.join(lst) + r')[^.]*\.', sen))

This would fail if decimal numbers present. If you also want to deal with those, then try this,

''.join(re.findall(r'(?:\d.\d|[^.])*(?:' + '|'.join(lst) + r')(?:\d\.\d|[^.])*\.', s))

edited Jul 14 '15 at 16:11

answered Jul 14 '15 at 16:05

Avinash Raj

172,303
28
230
274

score 0 · Answer 2 · answered Jul 14 '15 at 16:06

Don't use regex for non regular patterns, they are difficult to understand, often verbose and inefficient.

What you want can easily be done with the following:

x = "Machine learning is the science of getting computers to act without being explicitly programmed. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI."

l = ['computer', 'researcher']

for line in x.split('.'):
    for word in l:
        if word in line:
            print(line)
            break

output:

Machine learning is the science of getting computers to act without being explicitly programmed
 Many researchers also think it is the best way to make progress towards human-level AI

score 0 · Accepted Answer · edited May 23 '17 at 11:43

Based partially on this answer:

import nltk

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text = "Machine learning is the science of getting computers to act without being explicitly programmed. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI."
word_list = ['computer', 'researcher']
output_list = []

for sentence in tokenizer.tokenize(text):
    for word in word_list:
        if word in sentence:
            output_list.append(sentence)
            break # useful when word_list is large

You need to run nltk.download() beforehand and download punkt in the Models tab.

This is exactly what I was looking for. I was already using nltk and tokenizer in previous steps but didn't know there was a sentence function.Thanks fenceop! — Jton, Jul 14 '15 at 17:10

How to get sentences from a paragraph with custom list of words in Python

3 Answers3