0

I have a dataframe with a 'description' column with details about the product. Each of the description in the column has long paragraphs. Like

"This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much"

How do I locate/extract the sentence which has the phrase "superb product", and place it in a new column?

So for this case the result will be expected output

I have used this,

searched_words=['superb product','SUPERB PRODUCT']


print(df['description'].apply(lambda text: [sent for sent in sent_tokenize(text)
                           if any(True for w in word_tokenize(sent) 
                                     if stemmer.stem(w.lower()) in searched_words)]))

The output for this is not suitable. Though it works if I put just one word in " Searched Word" List.

  • You need to look into Regex. That is what your talking about. Just a warning this question does not meet standards. people will vote it down. – Joe Jan 31 '18 at 23:38
  • How long is your list of search words? – alvas Feb 01 '18 at 06:22
  • @alvas it will be having 22000 records approx where each record will be having a paragraph of 300-8000 words – shivam negi Feb 01 '18 at 13:16
  • That's the number of rows in your data. How many search words are you planning to search. You have to consider the https://rob-bell.net/2009/06/a-beginners-guide-to-big-o-notation/ and https://stackoverflow.com/questions/487258/what-is-a-plain-english-explanation-of-big-o-notation if you're doing unindex search and the order of your `searced_words` esp. if you use `any()` / `all()` – alvas Feb 02 '18 at 01:04

2 Answers2

1

There are lot of methods to do that ,@ChootsMagoots gave you the good answer but SPacy is also so efficient, you can simply choose the pattern that will lead you to that sentence, but beofre that, you can need to define a function that will define the sentence here's the code :


import spacy

def product_sentencizer(doc):
    ''' Look for sentence start tokens by scanning for periods only. '''
    for i, token in enumerate(doc[:-2]):  # The last token cannot start a sentence
        if token.text == ".":
            doc[i+1].is_sent_start = True
        else:
            doc[i+1].is_sent_start = False  # Tell the default sentencizer to ignore this token
    return doc

nlp = spacy.load('en_core_web_sm',  disable=['ner'])
nlp.add_pipe(product_sentencizer, before="parser")  # Insert before the parser can build its own sentences
text = "This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much."
doc = nlp(text)

matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'ORTH': 'SUPERB PRODUCT'}] 


matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)         
    print(matched_span.sent)

Ayoubayjx
  • 137
  • 5
0

Assuming the paragraphs are neatly formatted into sentences with ending periods, something like:

for index, paragraph in df['column_name'].iteritems(): for sentence in paragraph.split('.'): if 'superb prod' in sentence: print(sentence) df['extracted_sentence'][index] = sentence

This is going to be quite slow, but idk if there's a better way.

ChootsMagoots
  • 670
  • 1
  • 6
  • 19
  • Hey thanks for the quick reply, That worked for me, though it has not created a new column and linked the sentences to the paragraph. Can you please tell how I can make sure that each extracted sentence corresponds to the paragraph-description without losing the index? I may need 3 column, ID DESCRIPTION SentencesExtracted – shivam negi Jan 31 '18 at 23:56
  • See my edit, that should do it. Please confirm my answer :) – ChootsMagoots Feb 01 '18 at 00:08
  • Oh, the problem with this is it will only save the last sentence in the new column. Not sure what you want to do in the case where there are multiple sentences containing 'superb product' – ChootsMagoots Feb 01 '18 at 00:09
  • You also might need to declare df['extracted_sentence'] as a column before the loop, not sure. Sorry for a bunch of comments – ChootsMagoots Feb 01 '18 at 00:14