1

I'm trying to extract top words by date as follows:

df.set_index('Publishing_Date').Quotes.str.lower().str.extractall(r'(\w+)')[0].groupby('Publishing_Date').value_counts().groupby('Publishing_Date')

in the following dataframe:

import pandas as pd 

# initialize 
data = [['20/05', "So many books, so little time." ], ['20/05', "The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid." ], ['19/05', 
"Don't be pushed around by the fears in your mind. Be led by the dreams in your heart."], ['19/05', "Be the reason someone smiles. Be the reason someone feels loved and believes in the goodness in people."], ['19/05', "Do what is right, not what is easy nor what is popular."]] 

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Publishing_Date', 'Quotes']) 

How you can see, there are many stop-words ("the", "an", "a", "be", ...), that I would like to remove in order to have a better selection. My aim would be to find some key words, i.e. patterns, in common by date so I would be more interested and focused on names rather than verbs.

Any idea on how I could remove stop-words AND keep only names?

Edit

Expected output (based on the results from Vaibhav Khandelwal's answer below):

Publishing_Date         Quotes       Nouns
  20/05                 ....        books, time, person, gentleman, lady, novel
19/05                   ....        fears, mind, dreams, heart, reason, smiles

I would need to extract only nouns (reasons should be more frequent so it would be ordered based on frequency).

I think it should be useful nltk.pos_tag where tag is in ('NN').

  • 2
    check NLP --NLTK – BENY Jun 06 '20 at 16:17
  • Does this answer your question? [How to remove stop words using nltk or python](https://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python) – Chris Jun 06 '20 at 16:17
  • Doing `from nltk.corpus import stopwords` `stop_list = stopwords.words('english')` `df['Quotes'].apply(lambda x: [item for item in x if item not in stop_list])` I got this error: `TypeError: 'float' object is not iterable` –  Jun 06 '20 at 16:23
  • But I do not know how to keep only the names from Quotes –  Jun 06 '20 at 16:29
  • what is your expected output? is it just removing stopwords from `'Quotes'` column? can upi also post the expected output in the question like you have posted the input – anky Jun 06 '20 at 17:15
  • I would need also to keep only names, not verbs, adj, pronouns,... @anky –  Jun 06 '20 at 17:16
  • I updated the question including an expected output (precision might be not good as I did manually) –  Jun 06 '20 at 17:30

1 Answers1

1

This is how you can remove stopwords from your text:

import nltk
from nltk.corpus import stopwords

def remove_stopwords(text):
    stop_words = stopwords.words('english')
    fresh_text = []

    for i in text.lower().split():
        if i not in stop_words:
            fresh_text.append(i)

    return(' '.join(fresh_text))

df['text'] = df['Quotes'].apply(remove_stopwords)

NOTE: If you want to remove words append explicitly in the stopwords list

output of the above code

For your other half you can add another function to extract nouns:

def extract_noun(text):
token = nltk.tokenize.word_tokenize(text)
result=[]
for i in nltk.pos_tag(token):
    if i[1].startswith('NN'):
        result.append(i[0])

return(', '.join(result))

df['NOUN'] = df['text'].apply(extract_noun)

The final output will be as follows:

The final output after the noun extraction

  • Thank you @Vaibhav Khandelwal. Your solution partially answers my question but it is good and helpful, so I am voting it –  Jun 06 '20 at 17:31