Replace column values based on another dataframe python pandas - better way?

Question

Note:for simplicity's sake, i'm using a toy example, because copy/pasting dataframes is difficult in stack overflow (please let me know if there's an easy way to do this).

Is there a way to merge the values from one dataframe onto another without getting the _X, _Y columns? I'd like the values on one column to replace all zero values of another column.

df1: 

Name   Nonprofit    Business    Education

X      1             1           0
Y      0             1           0   <- Y and Z have zero values for Nonprofit and Educ
Z      0             0           0
Y      0             1           0

df2:

Name   Nonprofit    Education
Y       1            1     <- this df has the correct values. 
Z       1            1



pd.merge(df1, df2, on='Name', how='outer')

Name   Nonprofit_X    Business    Education_X     Nonprofit_Y     Education_Y
Y       1                1          1                1               1
Y      1                 1          1                1               1
X      1                 1          0               nan             nan   
Z      1                 1          1                1               1

In a previous post, I tried combine_First and dropna(), but these don't do the job.

I want to replace zeros in df1 with the values in df2. Furthermore, I want all rows with the same Names to be changed according to df2.

Name    Nonprofit     Business    Education
Y        1             1           1
Y        1             1           1 
X        1             1           0
Z        1             0           1

(need to clarify: The value in 'Business' column where name = Z should 0.)

My existing solution does the following: I subset based on the names that exist in df2, and then replace those values with the correct value. However, I'd like a less hacky way to do this.

pubunis_df = df2
sdf = df1 

regex = str_to_regex(', '.join(pubunis_df.ORGS))

pubunis = searchnamesre(sdf, 'ORGS', regex)

sdf.ix[pubunis.index, ['Education', 'Public']] = 1
searchnamesre(sdf, 'ORGS', regex)

I don't quite understand your logic, you want to update the first df with the matching values from the other df but then you then set the business value for Z to 1, is that correct? It was 0 originally. — EdChum, Jul 15 '14 at 21:56

Jeremy Z · Answer 1 · 2019-10-09T01:10:57.340

Attention: In latest version of pandas, both answers above doesn't work anymore:

KSD's answer will raise error:

df1 = pd.DataFrame([["X",1,1,0],
              ["Y",0,1,0],
              ["Z",0,0,0],
              ["Y",0,0,0]],columns=["Name","Nonprofit","Business", "Education"])    

df2 = pd.DataFrame([["Y",1,1],
              ["Z",1,1]],columns=["Name","Nonprofit", "Education"])   

df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2.loc[df2.Name.isin(df1.Name),['Nonprofit', 'Education']].values

df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']].values

Out[851]:
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (3,)

and EdChum's answer will give us the wrong result:

 df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']]

df1
Out[852]: 
  Name  Nonprofit  Business  Education
0    X        1.0         1        0.0
1    Y        1.0         1        1.0
2    Z        NaN         0        NaN
3    Y        NaN         1        NaN

Well, it will work safely only if values in column 'Name' are unique and are sorted in both data frames.

Here is my answer:

Way 1:

df1 = df1.merge(df2,on='Name',how="left")
df1['Nonprofit_y'] = df1['Nonprofit_y'].fillna(df1['Nonprofit_x'])
df1['Business_y'] = df1['Business_y'].fillna(df1['Business_x'])
df1.drop(["Business_x","Nonprofit_x"],inplace=True,axis=1)
df1.rename(columns={'Business_y':'Business','Nonprofit_y':'Nonprofit'},inplace=True)

Way 2:

df1 = df1.set_index('Name')
df2 = df2.set_index('Name')
df1.update(df2)
df1.reset_index(inplace=True)

More guide about update.. The columns names of both data frames need to set index are not necessary same before 'update'. You could try 'Name1' and 'Name2'. Also, it works even if other unnecessary row in df2, which won't update df1. In other words, df2 doesn't need to be the super set of df1.

Example:

df1 = pd.DataFrame([["X",1,1,0],
              ["Y",0,1,0],
              ["Z",0,0,0],
              ["Y",0,1,0]],columns=["Name1","Nonprofit","Business", "Education"])    

df2 = pd.DataFrame([["Y",1,1],
              ["Z",1,1],
              ['U',1,3]],columns=["Name2","Nonprofit", "Education"])   

df1 = df1.set_index('Name1')
df2 = df2.set_index('Name2')


df1.update(df2)

result:

      Nonprofit  Business  Education
Name1                                
X           1.0         1        0.0
Y           1.0         1        1.0
Z           1.0         0        1.0
Y           1.0         1        1.0

It helped me a lot. The way community members like you come back and put the latest details for the next bunch of knowledge seekers is really very commendable. Thank you so much!! @Jeremy Z — ChandanJha, Aug 30 '19 at 19:49
Thanks @JeremyZ. The **Way 1** was the right one for me! But I don't quite understand the sencond way, the update() method, would update ALL numerical columns of the first DF ? Can you select one specific column to update, with update() ? — Emiliano, Dec 23 '19 at 16:32
@Emiliano If you want to select one specific column, just try df1["Education"].update(df2["Education"]) — Jeremy Z, Dec 24 '19 at 03:34
@Jeremy Z way 2 is not working!ValueError: cannot reindex from a duplicate axis — Zesty Dragon, Aug 17 '20 at 18:11
@ZestyDragon Could you tell me which version of python and pandas you use? I test the way 2 code. It works in pandas 0.24.2 and 1.1.0 and python 3.7.3. — Jeremy Z, Aug 21 '20 at 08:45
@JeremyZ The update with a subset of columns doesn't work. df1["Education"].update(df2["Education"]) doesn't update df1 because df1["Education"] is a copy. The correct way is df1.update(df2["Education"]), only the column "Education" will be update in df1, that's what we want — Nicolas Dufaur, Apr 09 '21 at 11:33
@NicolasDufaur Based on what you mean above,I use the same code example and add 2 more lines code below. What I find is df1.loc["Y","Education"] doesn't change to 2 when df2.loc["Y","Education"] changes to 2: df1["Education"].update(df2["Education"]) df2.loc["Y","Education"] = 2. Also, df1["Education"].update(df2["Education"]) works, since the update of series works in this scenario. For more details, please see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.update.html — Jeremy Z, Apr 12 '21 at 02:40
The solution with the update leads to erroneous results if the index of the two dataframes does not overlap. — Aneho, May 23 '22 at 11:12

EdChum · Accepted Answer · 2017-04-23T20:21:36.953

43

Use the boolean mask from isin to filter the df and assign the desired row values from the rhs df:

In [27]:

df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']]
df
Out[27]:
  Name  Nonprofit  Business  Education
0    X          1         1          0
1    Y          1         1          1
2    Z          1         0          1
3    Y          1         1          1

[4 rows x 4 columns]

edited Apr 23 '17 at 20:21

answered Jul 15 '14 at 21:45

EdChum

376,765
198
813
562

1

I have seen this giving me errors. Looks like the answer below is the one that actually works. – vijayshankarv Jan 09 '19 at 18:26
This relies on the indices matching which is what the op wanted, it's a different problem if the index doesn't match, in which case the other answer is more appropriate. However, index alignment is one of pandas' main features so it depends on the use case – EdChum Jan 09 '19 at 18:30
You're right about the index. Apologies for the downvote. – vijayshankarv Jan 22 '19 at 20:54

score 25 · Answer 3 · edited Oct 03 '17 at 19:45

25

In [27]: This is the correct one.

df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']].values

df
Out[27]:

Name  Nonprofit  Business  Education

0    X          1         1          0
1    Y          1         1          1
2    Z          1         0          1
3    Y          1         1          1

[4 rows x 4 columns]

The above will work only when all rows in df1 exists in df . In other words df should be super set of df1

Incase if you have some non matching rows to df in df1,you should follow below

In other words df is not superset of df1 :

df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = 
df1.loc[df1.Name.isin(df.Name),['Nonprofit', 'Education']].values

edited Oct 03 '17 at 19:45

Rominus

1,181
2
14
29

answered Oct 03 '17 at 18:05

KSD

251
3
5

This answer worked for me whereas the accepted did not. For the accepted answer, the assignment of values was mismatched after executing the isin line. – demongolem Oct 22 '18 at 18:12
Same here. With the others I was getting the correct entries and NaN's when updating strings. With this it is correct for my update at least. – sectechguy May 08 '20 at 20:32
This worked for me. The .values at the end helped NaN not to come for the last record. The previous two answers didn't work. – Biswajit Ghoshal Sep 21 '20 at 12:10
.values work for me as well – Jose Urena Aug 14 '23 at 23:43

score 7 · Answer 4 · edited Feb 04 '20 at 15:25

7

df2.set_index('Name').combine_first(df1.set_index('Name')).reset_index()

edited Feb 04 '20 at 15:25

Kevin

16,549
8
60
74

answered Feb 04 '20 at 15:00

AVK

75
1
6

Replace column values based on another dataframe python pandas - better way?

4 Answers4

Attention: In latest version of pandas, both answers above doesn't work anymore:

Way 1:

Way 2:

Linked

Related