Python - Extracting sentences from paragraphs

Question

I am new to python & can use some help:

This is just a sample :

I have a dictionary (with same key values repeating inside a list:

list_dummy = [{'a': 1, 'b':"The house is great. I loved it.",'e':"loved,the"}, {'a': 3, 'b': "Building is white in colour. I liked it.",'e':"colour"}, {'a': 5, 'b': "She is looking pretty. She is in my college",'e':"pretty"}]

'b' - consists of body text 'e' - consists of words(can be more than one)

I want to extract sentences out of 'b' which contains either one or more words from 'e' in them.

I need to first split the text into sentences by sent_tokenize & than need to extract. Sent_tokenize takes only string as an input. How to proceed?

Brandon Hadfield · Answer 1 · 2017-10-02T20:21:25.843

0

Well I can't seem to get the nltk module working to test but as long as sent_tokenize() returns a list of sentence strings something like this I think should do what you're hoping (if I understood correctly):

ans = []
for d in list_dummy:
    tmp = sent_tokenize(d['b'])
    s = [x for x in tmp if any(w.upper() in x.upper() for w in d['e'].split(","))]
    ans += s

This assumes that e will always be a comma separated list and that you're interested in case insensitive searching. The ans variable will just be a flat list of sentences that contain a word from the 'e' value in the dictionary.

EDIT

If you prefer using regular expressions you could use the re module:

import re
ans = []
for d in list_dummy:
    b = sent_tokenize(d['b'])
    e = d['e'].split(",")
    rstring = ".*" + "|".join(e) + ".*"
    r = re.compile(rstring)
    ans.append([x for x in b if r.match(x)])

edited Oct 02 '17 at 20:21

answered Oct 02 '17 at 18:03

Brandon Hadfield

1
3

the code doesnot work.. If I print (ans), it gives me : ['The house is great.', 'I loved it.', 'Building is white in colour.', 'I liked it.', 'She is looking pretty.', 'She is in my college']. Gives me back whole sentences only. – Deepti Oct 02 '17 at 18:22
Hi Deepti, I think I made a mistake when I originally posted. Does the edited code fix the issue? – Brandon Hadfield Oct 02 '17 at 18:31
Yes , now it gives me only those sentences. But how can I split sentences & than correspondingly get required extracted sentences for those value in the dictionary. As I need to export that into excel & do some manual sentiment tagging. For eg. I need extracted sentence or sentences for the first element in the list in first row row. 2nd element extracted sentence in 2nd row. With the above code, the position will change if there are more than one sentence. – Deepti Oct 02 '17 at 18:46
If I give only below code ans = [] for d in list_dummy: tmp = sent_tokenize(d['b']) ans += s print(tmp) contd.. – Deepti Oct 02 '17 at 18:53
contd... Output is : ['The house is great.', 'I loved it.'] ['Building is white in colour.', 'I liked it.'] ['She is looking pretty.', 'She is in my college'] 'tmp' will only have last line. How can I combine all, each corresponding to there own element – Deepti Oct 02 '17 at 18:56
I'm afraid I don't fully understand what you're trying to achieve. If you need to keep track of which dictionary each of the sentences come from, then changing `ans += s` to `ans.append(s)` should cause `ans` to be a list of lists, each corresponding to a dictionary in your original `list_dummy` variable. If you need to track specifically which word each sentence contains that shouldn't be a big deal, it would just take more formatting. – Brandon Hadfield Oct 02 '17 at 18:58
For sure Deepti, I answered your question? Also, the upper calls are so that the search is case insensitive. If you don't want "The" to match "the" you can take that out. – Brandon Hadfield Oct 02 '17 at 19:08
Any idea on how to extract the sentences through regular expression? – Deepti Oct 02 '17 at 19:56
Hey Deepti, take a look at the answer, I think that should work if you want to use regular expressions – Brandon Hadfield Oct 02 '17 at 20:21
It works for the said data , what if the 'e' is an list. split function does not work in that case.I have to apply regular expression over Danish term which has diff characters. – Deepti Oct 03 '17 at 15:10

Python - Extracting sentences from paragraphs

1 Answers1

Linked