I have a case of a dataframe containing a categorical variable of high cardinality (many unique values). I would like to re-code that variable to a set of values (the top most frequent values) and replace all other values with a catch-all category ("others"). To give a simple example:
Here are the two values which should stay unchanged:
top_values = ['apple', 'orange']
I established them based on their frequency in the following dataframe column:
{'fruits': {0: 'apple',
1: 'apple',
2: 'orange',
3: 'orange',
4: 'banana',
5: 'grape'}}
That dataframe column should be re-coded as follows:
{'fruits': {0: 'apple',
1: 'apple',
2: 'orange',
3: 'orange',
4: 'other',
5: 'other'}}
How to do that? (The dataframe has millions of records)