-1

So I have 2 dataframes, from different sizes, df1 = (578, 81) and df2 = (1500, 59), all lines on df1 exists in df2, and all columns in df2 exists in df1, my problem is, I have a value that i want to update in df1 based on 6 conditions, so to update the column X, the values at columns X1, X2, Y1, Y2, Z1 and Z2 must be equal on both DataFrames.

On java I would do somenthing like:

for(i=0;i<df1.length;i++){
    for(k=0;k<df2.length;k++){
        if(df1[i][1]==df2[k][1] && df1[i][2]==df2[k][2] ...){
            df1[i][0] = df2[k][0];
    }
}
Poojan
  • 3,366
  • 2
  • 17
  • 33
  • 2
    Please see [How to create good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and provide a [mcve] for your issue with sample input and output data, so that we can give more specific help – G. Anderson Jul 29 '19 at 19:16
  • I have no idea how to do that, so i can't provide a reproducible example, i will just load the dataframes from a csv file, that is as far as i got. – Pablo Pizutti Jul 29 '19 at 19:26
  • Hi Pablo, you could just copy some of the lines out of your csv. – jottbe Jul 29 '19 at 19:29

2 Answers2

2

You can easily use numpy.where. And i think it should work best in this case too.

Let's say you have the following DataFrames

import pandas as pd

df1=pd.DataFrame({'X':[1,3,4,6,5],
                  'X1':[2,3,4,6,3],
                  'Y1':[4,2,1,51,3],
                  'Z1':[2,3,4,1,5]})

df2=pd.DataFrame({'L':[2,3,4,1,4],
                  'X2':[2,3,4,6,5],
                  'Y2':[4,3,4,6,3],
                  'Z2':[2,2,1,51,3]})

And you want to change the value of X based on the conditions if X1==X2 & Y1==Y2 & Z1==Z2 . Also lets say the value you want to update is from column L in this case.

You can use numpy.where like this

df1['X']=np.where((df1['X1']==df2['X2'])&(df1['Y1']==df2['Y2'])&(df1['Z1']==df2['Z2']),df2['L'],,df1['X'])

It would only change the first row as the conditions only gets satisfied there. This function is changing the values to df2['L'] if it meets the condition and keeping the original values if the conditions are not met.

Read more about np.where

Update: The dataframes in the question are not equal. It doesn't matter if they don't have equal columns but the rows should be equal for the sake of comparison. Below is the example in which the two data frames are not equal and how numpy.where is performed in that case.

import pandas as pd
import numpy as np

df1=pd.DataFrame({'X':[1,3,4,6,5],
                  'X1':[2,3,4,6,3],
                  'Y1':[4,3,1,51,3],
                  'Z1':[2,3,4,1,5]})

df2=pd.DataFrame({'L':[2,3,4,1,4,5,1],
                  'X2':[2,3,4,6,5,2,3],
                  'Y2':[4,3,4,6,3,8,7],
                  'Z2':[2,3,1,51,3,9,9],
                  'R2':[2,5,1,2,7,3,9]})

#make both the dataframes equal

for i in range(len(df2)-len(df1)):
    df1=df1.append(pd.Series(),ignore_index=True)

df1['X']=np.where((df1['X1']==df2['X2'])&(df1['Y1']==df2['Y2'])&(df1['Z1']==df2['Z2']),df2['L'],df1['X'])

#drop those null values which have been appended above to get back to original
df1=df1.dropna(how='all')    

Kartikeya Sharma
  • 1,335
  • 1
  • 10
  • 22
  • I get an error: ValueError: Can only compare identically-labeled Series objects, i got the same error whe i tried using a mask,. – Pablo Pizutti Jul 29 '19 at 19:46
  • Check the updated answer! The error happened because of unequal rows in the two dataframes. Proposed solution is adding some null values to the dataframe to make it equal and then removing those values. Also you should copy and paste some lines from your excel files if they are not confidential to give us a better idea about what you are looking to achieve. You can also change the values if the data is confidential. – Kartikeya Sharma Jul 29 '19 at 20:25
  • I believe yout answer is comparing row by row, and it's not working, i have to search df2 for a match, and update X for the match, for exaomple, the row 1 of df1 is a match with row 3 of df2, df1 is a filtered df2, with bad rows deleted. – Pablo Pizutti Jul 29 '19 at 21:02
0

Do the 6 columns make a row unique? If so, I would use merge:

columns_join= ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
columns_update= ['upd1', 'upd2', 'upd3']

df_merged= df_to_update[columns_join].merge(df_source[columns_join + columns_update], on=columns_join, how='left', suffixes=['', '_src'], indicator='_join_ind')

for col in columns_update:
    df_to_update.loc[df_merged['_join_ind']=='both', col]= df_merged[col + '_src']

# now df_to_update contains the result

If the 6 fields do not identify a row (could result in more rows), you would need to make the merged result unique afterwards, but in that case, your Java version would also just always result in the values of the last row. In that case you can use DataFrame.grouby and e.g. last as the aggregation function.

EDIT: in case you need to apply an aggregation, it's probably best to apply the aggregation to the dataframe you take the values. To aggregate it accordingly using the last value of each group and merge it afterwards, just replace the line with the .merge above by:

df_agg= df_source.groupby(columns_join)[columns_update].aggregate('last')
df_merged= df_to_update[columns_join].merge(df_source[columns_join + columns_update], left_on=columns_join, right_index=True, how='left', suffixes=['', '_src'], indicator='_join_ind')
jottbe
  • 4,228
  • 1
  • 15
  • 31
  • For future use, I think it is missing columns_update on on=, i think it is the solution for my case, but i'm getting a key error, missing 2 indexes, the firsts 10 rows the values are equals, this can be for a few rows that doesn't have a match? but there should be a match for every single row – Pablo Pizutti Jul 29 '19 at 20:23
  • Thank you. Your right, at first I used the list in that place instead of the variable. – jottbe Jul 29 '19 at 20:24
  • So i found the error, the column name had 2 spaces more in one of the dataframes, but i still get an error of key missing on the de column that i want to update, and i checked, they are correct – Pablo Pizutti Jul 29 '19 at 21:59
  • I'm lost. What did you try and what error do you get? Did you additionally apply the aggregation I proposed above? – jottbe Jul 30 '19 at 09:18