Mapping pandas dataframe column to a dictionary

Question

I have a case of a dataframe containing a categorical variable of high cardinality (many unique values). I would like to re-code that variable to a set of values (the top most frequent values) and replace all other values with a catch-all category ("others"). To give a simple example:

Here are the two values which should stay unchanged:

top_values = ['apple', 'orange']

I established them based on their frequency in the following dataframe column:

{'fruits': {0: 'apple',
1: 'apple',
2: 'orange',
3: 'orange',
4: 'banana',
5: 'grape'}}

That dataframe column should be re-coded as follows:

{'fruits': {0: 'apple',
1: 'apple',
2: 'orange',
3: 'orange',
4: 'other',
5: 'other'}}

How to do that? (The dataframe has millions of records)

score 7 · Accepted Answer · answered Nov 07 '18 at 18:37

7

There are at least a couple of methods you can use:

`where` + Boolean indexing

df['fruits'].where(df['fruits'].isin(top_values), 'other', inplace=True)

`loc` + Boolean indexing

df.loc[~df['fruits'].isin(top_values), 'fruits'] = 'other'

After this process, you will probably want to turn your series into a categorical:

df['fruits'] = df['fruits'].astype('category')

Doing this before the value replacement operation probably won't help as your input series has high cardinality.

answered Nov 07 '18 at 18:37

jpp

159,742
34
281
339

It occurs to me that the WHERE code snippet lacks reversion - it would replace the values that match pattern, rather than those which don't. – Nick Nov 10 '18 at 15:09
@Nick, Yep it's deceptive (vs e.g. `np.where`). Use `pd.Series.mask` to change values matching a condition; use `pd.Series.where` to change values *not* matching a condition. The lack of intuition is probably why it has never caught on. – jpp Nov 10 '18 at 15:22

score 1 · Answer 2 · answered Nov 07 '18 at 18:41

1

df.newCol = df.apply(lambda row: row.fruits if row.fruits in top_values else 'others' )

answered Nov 07 '18 at 18:41

Venkatachalam

16,288
9
49
77

Mapping pandas dataframe column to a dictionary

2 Answers2

`where` + Boolean indexing

`loc` + Boolean indexing

Linked

Mapping pandas dataframe column to a dictionary

2 Answers2

where + Boolean indexing

loc + Boolean indexing

Linked

`where` + Boolean indexing

`loc` + Boolean indexing