6

I have 3 different DataFrames (1 master DataFrame and 2 additional DataFrames). I am trying to add a column to my master DataFrame, with the elements of the column being different cell values in the other two DataFrames. I am using two columns of the master DataFrame to figure out which of the 2 DataFrames I need to get data from, and two more columns to act as indexes to a particular cell in the selected DataFrame.


master_df = pd.DataFrame({
    'col1': ['M', 'F', 'F', 'M'],
    'col2': [0, 1, 2, 3],
    'col3': ['X', 'Z', 'Z', 'X'],
    'col4': [2021, 2022, 2023, 2024]
})

df1 = pd.DataFrame({
    2021: [.632, .214, .987, .555],
    2022: [.602, .232, .287, .552],
    2023: [.932, .209, .347, .725],
    2024: [.123, .234, .9873, .5005]
})

df2 = pd.DataFrame({
    2021: [.6123, .2214, .4987, .555],
    2022: [.6702, .232, .2897, .552],
    2023: [.9372, .2, .37, .725],
    2024: [.23, .24, .873, .005]
})

For each row of the master_df, if the col1 value is 'M' and the col3 value is 'X', I want to choose df1. If the col1 value is 'F' and the col3 value is 'Z', I want to choose df2. Once I have selected the appropriate DataFrame, I want to use col2 of the master_df as a row index and col4 of the master_df as a column index. Finally, I will get the selected cell value and put it into the new column to be added to the master_df.

In this example, master_df should look like this at the end:

master_df = pd.DataFrame({
    'col1': ['M', 'F', 'F', 'M'],
    'col2': [0, 1, 2, 3],
    'col3': ['X', 'Z', 'Z', 'X'],
    'col4': [2021, 2022, 2023, 2024],
    'col5': [.632, .232, .37, .5005]
})

I have tried using a for loop to iterate through the master_df, but it is extremely slow since the DataFrames that I'm working with have millions of rows each. Any efficient pandas solutions for this?

hbdch
  • 103
  • 6

2 Answers2

3

Your master_df has only 2 combinations of value for master_df.col1 and master_df.col3. Therefore, a simple .lookup and np.where will yield your desired output

df1_val = df1.lookup(master_df.col2, master_df.col4)
df2_val = df2.lookup(master_df.col2, master_df.col4)
master_df['col5'] = np.where(master_df.col1.eq('M') & master_df.col3.eq('X'), df1_val, df2_val)

Out[595]:
  col1  col2 col3  col4    col5
0  M    0     X    2021  0.6320
1  F    1     Z    2022  0.2320
2  F    2     Z    2023  0.3700
3  M    3     X    2024  0.5005

Note: if master_df.col1 and master_df.col3 have more than 2 combinations of values, you just need np.select instead of np.where

Andy L.
  • 24,909
  • 4
  • 17
  • 29
2

Here is a solution without using a for loop, I wish it'll work for you

firs we'll make two filter for which dataframe to use

df1_filter = (master_df["col1"] == 'M') & (master_df["col3"] == 'X') 
df2_filter = (master_df["col1"] == 'F') & (master_df["col3"] == 'Z') 

second, for each dataframe, we'll use the appropriate filter to get the values of interest for df1

row1_index = master_df[df1_filter]["col2"]
col1_index = master_df[df1_filter]["col4"]
df1_values_of_interest = df1.iloc[row1_index][col1_index]

for df2

row2_index = master_df[df2_filter]["col2"]
col2_index = master_df[df2_filter]["col4"]
df2_values_of_interest = df2.iloc[row2_index][col2_index]

with this approache,the values of interest are going to be in the diagonal,so we'll try to get them (each one with it's appropriate index) and concatenate them

aa = pd.Series(np.diag(df1_values_of_interest), index=df1_values_of_interest.index)
bb = pd.Series(np.diag(df2_values_of_interest), index=df2_values_of_interest.index)
res = pd.concat([aa, bb])

finally, we'll add the result to the master df

master_df['col5'] = res

I hope the solution is clear,and it'll work for you.if you need more clarification don't hesitate to ask. good luck !