9

I've a column in first data-frame df1["ItemType"] as below,

Dataframe1

ItemType1
redTomato
whitePotato
yellowPotato
greenCauliflower
yellowCauliflower
yelloSquash
redOnions
YellowOnions
WhiteOnions
yellowCabbage
GreenCabbage

I need to replace that based on a dictionary created from another data-frame.

Dataframe2

ItemType2          newType
whitePotato        Potato
yellowPotato       Potato
redTomato          Tomato
yellowCabbage   
GreenCabbage    
yellowCauliflower   yellowCauliflower
greenCauliflower    greenCauliflower
YellowOnions        Onions
WhiteOnions         Onions
yelloSquash         Squash
redOnions           Onions

Notice that,

  • In dataframe2 some of the ItemType are same as ItemType in dataframe1.
  • Some ItemType in dataframe2 have null values like yellowCabbage.
  • ItemType in dataframe2 are out of order with respect toItemType in dataframe

I need to replace values in Dataframe1 ItemType column if there is a match for value in the corresponding Dataframe2 ItemType with newType keeping above exceptions listed in bullet-points in mind.
If there is no match, then values needs to be as they are [ no change].

So far I got is.

import pandas as pd

#read second `csv-file`
df2 = pd.read_csv('mappings.csv',names = ["ItemType", "newType"])
#conver to dict
df2=df2.set_index('ItemType').T.to_dict('list')

Below given replace on match are not working. They are inserting NaN values instead of actual. These are based on discussion here on SO.

df1.loc[df1['ItemType'].isin(df2['ItemType'])]=df2[['NewType']]

OR

df1['ItemType']=df2['ItemType'].map(df2)

Thanks in advance

EDIT
Two column headers in both data frames have different names. So dataframe1 column on is ItemType1 and first column in second data-frame is ItemType2. Missed that on first edit.

Community
  • 1
  • 1
Anil_M
  • 10,893
  • 6
  • 47
  • 74

3 Answers3

9

Use map

All the logic you need:

def update_type(t1, t2, dropna=False):
    return t1.map(t2).dropna() if dropna else t1.map(t2).fillna(t1)

Let's make 'ItemType2' the index of Dataframe2

update_type(Dataframe1.ItemType1,
            Dataframe2.set_index('ItemType2').newType)

0                Tomato
1                Potato
2                Potato
3      greenCauliflower
4     yellowCauliflower
5                Squash
6                Onions
7                Onions
8                Onions
9         yellowCabbage
10         GreenCabbage
Name: ItemType1, dtype: object

update_type(Dataframe1.ItemType1,
            Dataframe2.set_index('ItemType2').newType,
            dropna=True)

0                Tomato
1                Potato
2                Potato
3      greenCauliflower
4     yellowCauliflower
5                Squash
6                Onions
7                Onions
8                Onions
Name: ItemType1, dtype: object

Verify

updated = update_type(Dataframe1.ItemType1, Dataframe2.set_index('ItemType2').newType)

pd.concat([Dataframe1, updated], axis=1, keys=['old', 'new'])

enter image description here


Timing

def root(Dataframe1, Dataframe2):
    return Dataframe1['ItemType1'].replace(Dataframe2.set_index('ItemType2')['newType'].dropna())

def piRSquared(Dataframe1, Dataframe2):
    t1 = Dataframe1.ItemType1
    t2 = Dataframe2.set_index('ItemType2').newType
    return update_type(t1, t2)

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 2
    Hooray for time measurements... +1 – mechanical_meat Jul 19 '16 at 19:54
  • Hi- thx for response, still checking out. The solution also needs to omit/drop `ItemType1` items for which corresponding values are `empty` , `null` in second dataframe `newType` columns. So here it should drop `yellowCabbage` and `GreenCabbage` such that they shouldnt appear in the final table. – Anil_M Jul 19 '16 at 20:34
  • @Anil_M that's easy, I forcibly put that back in. Updated post with optional dropna paramter. – piRSquared Jul 19 '16 at 20:39
  • @piRSquared - I am getting `NameError: global name 'dropna' is not defined` for `return t1.map(t2).dropna() if dropna else t1.map(t2).fillna(t1)` – Anil_M Jul 19 '16 at 20:57
  • @Anil_M did you put the `dropna=False` in the signature of the function definition? – piRSquared Jul 19 '16 at 21:01
  • Yes, that fixed it. – Anil_M Jul 20 '16 at 19:02
4

You can convert df2 into a Series indexed by 'ItemType2', and then use replace on df1:

# Make df2 a Series indexed by 'ItemType'.
df2 = df2.set_index('ItemType2')['newType'].dropna()

# Replace values in df1.
df1['ItemType1'] = df1['ItemType1'].replace(df2)

Or in a single line, if you don't want to alter df2:

df1['ItemType1'] = df1['ItemType1'].replace(df2.set_index('ItemType2')['newType'].dropna())
root
  • 32,715
  • 6
  • 74
  • 87
  • 2 layer question here. Firstly, when I am trying to run this now on 1 column, I get a MemoryError, what can be done about that. Second question, I am trying to use this in the work I am doing now but I need something more complicated. I want to apply the one column to match to a huge dataframe with a bunch of columns (about 100) and rows. How would I modify the code to achieve that? – bernando_vialli Nov 03 '17 at 17:39
3

This method requires you set your column names to 'type', then you can set off using merge and np.where

df3 = df1.merge(df2,how='inner',on='type')['type','newType']

df3['newType'] = np.where(df['newType'].isnull(),df['type'],df['newType'])
draco_alpine
  • 769
  • 11
  • 25
  • Hi - Thanks for quick replay. I have done slight changes to question. Two column headers in both data frames have different names. So dataframe1 column on is `ItemType1` and first column in second dataframe is `ItemType2`. Also , above solution is giving error as `KeyError: 'type'` – Anil_M Jul 19 '16 at 19:27
  • the error on 'type' and the issues with ItemType1 and ItemType2 are one and the same. Specifically, I'm trying to join on 'type' when actually neither df has column 'type' instead they have 'ItemType1' and 'ItemType2'. Personally I would rename the columns to ItemType in both df's and proceed. But the other solutions offered may be more suitable to your specific needs. – draco_alpine Jul 19 '16 at 19:52