I'm trying to extract top words by date as follows:
df.set_index('Publishing_Date').Quotes.str.lower().str.extractall(r'(\w+)')[0].groupby('Publishing_Date').value_counts().groupby('Publishing_Date')
in the following dataframe:
import pandas as pd
# initialize
data = [['20/05', "So many books, so little time." ], ['20/05', "The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid." ], ['19/05',
"Don't be pushed around by the fears in your mind. Be led by the dreams in your heart."], ['19/05', "Be the reason someone smiles. Be the reason someone feels loved and believes in the goodness in people."], ['19/05', "Do what is right, not what is easy nor what is popular."]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Publishing_Date', 'Quotes'])
How you can see, there are many stop-words ("the", "an", "a", "be", ...
), that I would like to remove in order to have a better selection. My aim would be to find some key words, i.e. patterns, in common by date so I would be more interested and focused on names rather than verbs.
Any idea on how I could remove stop-words AND keep only names?
Edit
Expected output (based on the results from Vaibhav Khandelwal's answer below):
Publishing_Date Quotes Nouns
20/05 .... books, time, person, gentleman, lady, novel
19/05 .... fears, mind, dreams, heart, reason, smiles
I would need to extract only nouns (reasons should be more frequent so it would be ordered based on frequency).
I think it should be useful nltk.pos_tag
where tag is in ('NN').