-1

I have a dataframe that contains some missing values where the 'new_id' is an empty string. I have another dataframe that contains the values it should have, but does not have all the columns in the original dataframe so I can't just replace those rows by index. My original dataframe looks like:

df = pd.DataFrame({
    "id": ["1", "2", "3", "4", "5"],
    "new_id": ["", "", "23", "", "52"],
    "color": ["blue", "red", "green", "yellow", "green"],
    "age": [23, 11, 17, 13, 51],
    "trade": ["", "", "C", "", "B"],
    "color2": ["red", "yellow", "red", "blue", "purple"],
    "fruit": ["", "", "orange", "", "grape"]
})
id   new_id    color    age   trade    color2   fruit
1               blue    23              red     
2               red     11             yellow
3      23       green   17     C        red      orange
4               yellow  13              blue
5      52       green   51     B        purple   grape

The table with the data I need is:

df_map = pd.DataFrame({
    "id": ["1", "2", "4"],
    "new_id": ["", "", ""],
    "trade": ["B", "C", "A"],
    "fruit": ["apple", "orange", "apple"]
})
id   new_id    trade   fruit
1              B       apple
2              C       orange
4              A       apple

Desired output:

id   new_id    color    age   trade    color2   fruit
1               blue    23     B       red      apple
2               red     11     C       yellow   orange
3      23       green   17     C       red      orange
4               yellow  13     A       blue     apple
5      52       green   51     B       purple   grape

How can I combine the information in the two data frames to get my full dataset, and only replace values where 'new_id' is an empty string.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
mjoy
  • 606
  • 2
  • 9
  • 19

2 Answers2

0

IIUC, you can first join both data frames on id via pd.DataFrame.join or pd.merge to create a temporary data frame containing your fruit and trade columns from both data frames. Then, you can just apply a pandas coalesce version using bfill (or combine_first) and assign the coalesce to you initial data frame.

See here for more details on the pandas coalesce.

Code:

import pandas as pd

# Define data frames
df = pd.DataFrame({
    "id": ["1", "2", "3", "4", "5"],
    "new_id": ["", "", "23", "", "52"],
    "color": ["blue", "red", "green", "yellow", "green"],
    "age": [23, 11, 17, 13, 51],
    "trade": ["", "", "C", "", "B"],
    "color2": ["red", "yellow", "red", "blue", "purple"],
    "fruit": ["", "", "orange", "", "grape"]
})

df_map = pd.DataFrame({
    "id": ["1", "2", "4"],
    "new_id": ["", "", ""],
    "trade": ["B", "C", "A"],
    "fruit": ["apple", "orange", "apple"]
})

# Join both data frames by setting index 
df_temp = (
    df.set_index(["id"])[["trade", "fruit"]]
    .join(
        df_map.set_index(["id"])
        .drop(columns=["new_id"])
        .rename(columns=lambda x: "temp_"+x)
        )
    .reset_index(drop=True)
)

# Apply coalesce
df["fruit"], df["trade"] = (
    df_temp[["temp_fruit", "fruit"]].bfill(axis=1).iloc[:, 0], 
    df_temp[["temp_trade", "trade"]].bfill(axis=1).iloc[:, 0]
)

Output:

id  new_id  color   age trade  color2   fruit
0   1       blue    23  B      red      apple
1   2       red     11  C      yellow   orange
2   3   23  green   17  C      red      orange
3   4       yellow  13  A      blue     apple
4   5   52  green   51  B      purple   grape
ko3
  • 1,757
  • 5
  • 13
  • how would I make this work if I have a large number of columns should be in the apply coalesce part – mjoy Aug 02 '22 at 20:23
  • @mjoy, just use `df_temp[["col1", "col2", ..., "colN"]].bfill(axis=1).iloc[:, 0]` – ko3 Aug 03 '22 at 05:50
0

I found a pretty simple way to do it using loc.

#loop over the df with the data that I need
for i in range(len(df_map)):

    #get the id value 
    map_id = df_map.id[i]

    #get the index that corresponds to the id in original dataframe
    ind = df.index[df['id'] == map_id].tolist()[0]

    #replace the values in the columns that correspond to the values in map_df
    df.loc[ind, list(df_map)] = df_map.iloc[i]
mjoy
  • 606
  • 2
  • 9
  • 19
  • I am not sure how many rows your data frame has, but in general iterating over your whole data is very ineffective and should be avoided (unless the operation is quite complex). If you have a small number of observations, this approach may be sufficiently fast, but for a larger frame it would be quite slow. – ko3 Aug 02 '22 at 14:55