I have a table of words with mixed categorical classifications. I want to make it so that the 'common type' column is of the highest occurring (mode) classification label such that each row has a label.
word type common type
post | WORK_OF_ART | WORK_OF_ART
post | WORK_OF_ART | WORK_OF_ART
post | WORK_OF_ART | WORK_OF_ART
post | WORK_OF_ART | WORK_OF_ART
post | WORK_OF_ART | WORK_OF_ART
post | OTHER | WORK_OF_ART
post | WORK_OF_ART | WORK_OF_ART
post | WORK_OF_ART | WORK_OF_ART
post | OTHER | WORK_OF_ART
-----|--------------------------
sign | OTHER | OTHER
sign | WORK_OF_ART | OTHER
sign | OTHER | OTHER
sign | WORK_OF_ART | OTHER
sign | OTHER | OTHER
sign | OTHER | OTHER
sign | WORK_OF_ART | OTHER
I the following function but on a dataframe of 1m+ rows the runtime is abysmal
def replace_most_common_type(frame, word):
common_type = frame[frame['word']==word]['type'].value_counts().idxmax()
frame.loc[frame['word']==word, 'type'] = common_type
unique_words = master_frame['word'].unique()
for idx, word in unique_words:
replace_most_common_type(master_frame, word)
built in pandas methods tend to be numpy vectorized so any solution using native pandas functions is appreciated