Find a value of a dictionary in dataframe column and modify it

Question

I dealing with DataFrames and Dictionaries now, and i have a problem, I have a Dictionary "Fruits"

{BN:'Banana', LM:'Lemon', AP:'Apple' ..... etc}

And a DataFrame- "Stock":

   Fruit             Price
0  Sweet Mango           1
1  Green Apple           2
2  Few blue Banana       0
3  Black Banana          5

I wand to do the next thing: replace all the values from Stock['Fruit'] with the Fruits.values() this way: if the value from Fruits appears in the Stock['Fruit'] row it will be replaced this way:

Few blue Banana ---> Banana

Black Banana ---> Banana

now the DataFrame Stock will look this way:

   Fruit             Price
0  Sweet Mango           1
1  Green Apple           2
2  Banana                0
3  Banana                5

I found different codes to replace or to check if values from the Dicitionary appears in the DataFrame

Stock['Fruit'] = Stock.Fruit.map(Fruits)

if (Fruits.values() in Stock['Fruit'] for item in Stock)

any('Mango' in Stock['Fruit'] for index,item in Stock.iterrows())

But i cant find any thing to update the rows of the DataFrame

Yes, and I'm asking why you don't change `Green Apple` to `Apple`, but you *do* change `Black Banana` to `Banana` in your output above — user3483203, Oct 03 '18 at 21:56
Does my answer include your desired output? If not, can you clarify what your complete desired output is? — rahlf23, Oct 03 '18 at 21:57

score 2 · Answer 1 · answered Oct 03 '18 at 21:54

2

IIUC, you can use apply() with a custom function:

import pandas as pd

df = pd.DataFrame([['Sweet Mango', 1],['Green Apple', 2],['Few blue Banana', 0],['Black Banana', 5]],
  columns=['Fruit','Price'])

fruits = {'BN':'Banana', 'LM': 'Lemon', 'AP':'Apple', 'MG': 'Mango'}

def find_category(x):

  return [k for k in fruits.values() if k in x][0]

df['Fruit'] = df['Fruit'].apply(find_category)

Yields:

    Fruit  Price
0   Mango      1
1   Apple      2
2  Banana      0
3  Banana      5

answered Oct 03 '18 at 21:54

rahlf23

8,869
4
24
54

1

‘apply()’ will be most efficient if dealing with larger dataframes. – JLuxton Oct 03 '18 at 22:08
@rahlf23 tnx, it works for the dataframe and dictionary i've gave as example, now i've tried to do this with another one, and there is a problem: the new dictionary `states = {'OH': 'Ohio', 'KY': 'Kentucky', 'AS': 'American Samoa', 'NV': 'Nevada'...} ` and a dataframe with columns ` State and Region` the state also contains few words that i want to replace with one word. when i copied your code and only changed the variables it gives me a mistake: `----> 5 return [k for k in states.values() if k in x][0] IndexError: list index out of range` when i deleted the `[0]` – B. Kristina Oct 03 '18 at 22:08
@rahlf23 when i deleted the `[0]` it gives me a "good" dataframe but with the next problems: ` State RegionName 0 [] Auburn 1 [Alabama] Florence 2 [] Jacksonville 3 [Alabama] Livingston 4 [] Montevallo 5 [] Troy 6 [Alabama] Tuscaloosa 7 [] Tuskegee 8 [Alaska] Fairbanks 9 [Arizona] Flagstaff` as you can see it skips some of the values and added "[]" to all the others – B. Kristina Oct 03 '18 at 22:10
@JLuxton the one that i use is much bigger, i just practised with small one =) – B. Kristina Oct 03 '18 at 22:11
@JLuxton `apply` will almost certainly be slower for larger dataframes than other options. *Especially* since this is an O(n*k) solution. You'd have better luck with a list comprehension, even `[find_category(x) for x in df.Fruit]` would be vastly more performant, but the algorithm itself will bottleneck performance at that point – user3483203 Oct 03 '18 at 22:19
1

@B.Kristina This solution will return `[]` if there are no exact matches in the values of your dictionary. The reason for the `[0]` is to return the first match (as I am assuming that your values will be unique and not be valid for multiple keys). If you remove `[0]`, then it will return the list of keys that contained that value. – rahlf23 Oct 03 '18 at 22:44

Vaishali · Accepted Answer · 2018-10-03T22:00:42.210

2

Use string methods for condition and extracting required values,

pat = r'({})'.format('|'.join(d.values()))
cond = df['Fruit'].str.contains('|'.join(d.values()))
df.loc[cond, 'Fruit'] = df['Fruit'].str.extract((pat), expand = False)

    Fruit       Price
0   Sweet Mango 1
1   Apple       2
2   Banana      0
3   Banana      5

Edit: As @user3483203 suggested, you can fill the missing values with original once the pattern is extracted.

df['Fruit'] = df['Fruit'].str.extract(pat).fillna(df.Fruit)

edited Oct 03 '18 at 22:00

answered Oct 03 '18 at 21:56

Vaishali

37,545
5
58
86

1

Or just `df.Fruit.str.extract(pat).fillna(df.Fruit)`. You also don't need to specify the `expand=False`, because you only have a single capture group – user3483203 Oct 03 '18 at 21:57

PMende · Answer 3 · 2018-10-03T23:01:16.663

Using the results of the answer here, we create a new class that subclasses defaultdict, and override its __missing__ attribute to allow for passing the key to the default_factory:

from collections import defaultdict
class keydefaultdict(defaultdict):
    def __missing__(self, key):
        if self.default_factory is None:
            raise KeyError(key)
        else:
            ret = self[key] = self.default_factory(key)
            return ret

We create an initial dictionary that maps the 2 values in the 'Fruits' column we want to replace.

fruit_dict = {'Few blue Banana': 'Banana', 'Black Banana': 'Banana'}

Then we create a new instance of our class with a default_factory of lambda x: x. I.e., if we don't find the key when we search for it, put the key in as the value.

fruit_col_map = keydefaultdict(lambda x: x)
fruit_col_map.update(**fruit_dict)

Finally, update the column:

df['Fruit'] = df['Fruit'].map(fruit_col_map)
df

Output:

         Fruit  Price
0  Sweet Mango      1
1  Green Apple      2
2       Banana      0
3       Banana      5

Comparing to the accepted answer, this is more than 6 times faster:

df = pd.DataFrame({
    'Fruit': ['Sweet Mango', 'Green Apple', 'Few blue Banana', 'Black Banana']*1000,
    'Price': [1, 2, 0, 5]*1000
})
%timeit df['Fruit'].map(fruit_col_map)

Results:

1.03 ms ± 48.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Accepted answer:

pat = r'({})'.format('|'.join(fruit_dict.values()))
%timeit df['Fruit'].str.extract(pat).fillna(df['Fruit'])

Results:

6.85 ms ± 223 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Find a value of a dictionary in dataframe column and modify it

3 Answers3

Linked