1

I have a pandas dataframe with 3 columns: key1, key2, document. All three columns are text fields with the size of document ranging from 50 characters to 5000 characters. I identify a vocabulary based on minimum frequency from the set of documents for each (key1, key2) for which I am using scikit-learn CountVectorizer and setting min_df. I am able to do this using df.groupby[['key1','key2']]['document'].apply(vocab).reset_index() where vocab is a function in which I compute and return the vocabulary (as defined above) as a set.

Now, I would like to use these vocabularies (one set for each key1, key2), to filter the corresponding documents so that each document only has words which are in its vocabulary. I would appreciate any help I can get with this part.

Sample data

Input

key1 | key2 | document
 aa  | bb   | He went home that evening. Then he had soup for dinner.
 aa  | bb   | We want to sit down and eat dinner
 cc  | mm   | Sometimes people eat in a restaurant
 aa  | bb   | The culinary skills of that chef are terrible.  Let us not go there.
 cc  | mm   | People go home after dinner and try to sleep.


Vocabulary - not using counts for the purpose of this example

key1 | key2 | vocab
 aa  | bb   | {went, evening, sit, down, culinary, chef, dinner}
 cc  | mm   | {people, restaurant, home, dinner, sleep}

Result - only use words from corresponding vocab in document

key1 | key2 | document
 aa  | bb   | went evening dinner
 aa  | bb   | sit down dinner
 cc  | mm   | people restaurant
 aa  | bb   | culinary chef
 cc  | mm   | people home dinner sleep
Joe
  • 12,057
  • 5
  • 39
  • 55
ironv
  • 978
  • 10
  • 25

1 Answers1

0

You can use first merge for add column vocab to first DataFrame:

import re

df = df.groupby[['key1','key2']]['document'].apply(vocab).reset_index()
df = pd.merge(df1, df2, on=['key1','key2'], how='left')

#another theoretical solution
#df['vocab'] = df.groupby[['key1','key2']]['document'].transform(vocab)

Then extract all words by findall, re.I is for ignore case and last remove column vocab:

df['document'] = df['document'].str.findall('\w+', flags=re.I)

Last get intersection between sets and convert to strings by str.join:

df['document'] = df.apply(lambda x: set(x['document']) & x['vocab'], axis=1).str.join(' ')
df = df.drop('vocab', axis=1)
print (df)
  key1 key2                  document
0   aa   bb       evening went dinner
1   aa   bb           sit down dinner
2   cc   mm         restaurant people
3   aa   bb             chef culinary
4   cc   mm  home people sleep dinner
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Jezrael, can you check this one https://stackoverflow.com/questions/48396321/how-to-merge-two-dataframes-based-on-a-column-in-pandas – Pyd Jan 23 '18 at 07:33