Pandas replace NaN values with data from a different dataframe

Question

I'm trying to replace NaN values in df1 with values from df2. The 'replacement' value in df2 is located by two inputs, p/n and operation.

Because the desired NaN value in df1 is dependent on the P/N and Operation, looking up the value in df2 must contain both the P/N and Operation. Sorting df1 for the Nan values can be found using:

df1 = df1.loc[df1['Time'].isnull()]

Where finding the replacement value can also be located

`

df2 = df2.loc[(df2['P/N'] == pn) & (df2['Operation'] == op), 'Time']

`

The dataframes are too large to merge (plus there is a benefit to keeping them separate)...how can df2 be used as a lookup table for the 'Time' column in df1?

This post is similar but I had a difficult time understanding the solution, plus I'm unsure how to make lambda dependent on two lookup values: Pandas replace all NaN values with data from a different dataframe

Any help is appreciated.

*** UPDATE *** UPDATE *** Posting the code used to troubleshoot the issue, special thanks to Naveed for the help!

'''

df1 = pd.DataFrame(np.array([['10/25/2022', 'Apple', 
'Wash','Joe','Washer','EA', 100, 100, 0, 0.25, 1.1, 90.91, 55],
                        ['10/25/2022', 'Apple', 
'Peel','Tom','Peeler','EA', 100, 95, 5, 0.25, 1.30, 73.08, pd.NA],
                        ['10/25/2022', 'Apple', 
'Package','Sally','Packing','EA', 95, 95, 0, 0.8, 5.0, 19.00, pd.NA],
                        ['10/25/2022','Orange', 'Wash', 
'Joe','Washer','EA', 100, 100, 0, 0.25, 1.1, 90.91, pd.NA],
                        ['10/25/2022','Orange', 
'Peel','Tom','Peeler','EA', 95, 95, 5, 0.25, 2.30, 41.30, pd.NA],
                        ['10/25/2022', 'Orange', 
'Package','Sally','Packing','EA', 95, 95, 0, 1.1, 5.30, 17.92, 
pd.NA],
                        ['10/25/2022','Banana', 'Peel & 
Eat','Tom','Peeler','EA', 100, 100, 0, 0.25, 1.1, 90.91, pd.NA]]),
                        columns=['date', 'fruit', 'operation', 
'operator', 'station', 'uom', 'pcs_processed', 'pcs_accepted',
                         'pcs_reject', 'setup', 'runtime', 'actual', 
'target'])

df2 = pd.DataFrame(np.array([['Apple', 'Wash', 'EA', 0.25, 0.25, 96],
                        ['Apple', 'Peel', 'EA', 0.33, 0.1, 11.63],
                        ['Apple', 'Slice', 'EA', 1.25, 0.25,32],
                        ['Apple', 'Package', 'EA', 2.0, 1.0, 16.0],
                        ['Orange', 'Wash', 'EA', 0.25, 0.25, 96.0],
                        ['Orange','Peel', 'EA', 1.0, 0.1, 43.64],
                        ['Orange', 'Slice','EA', 0.25, 0.25, 96.0],
                        ['Orange', 'Package', 'EA', 2.0, 1.0, 16.0],
                        ['Banana', 'Peel', 'EA', 0.5, 0.25, 64.0],
                        ['Banana', 'Slice', 'EA', 1.10, 0.25, 35.56],
                        ['Banana', 'Package', 'EA', 1.50, 1.00, 
19.20],
                        ['Banana', 'Peel & Eat', 'EA', 2.0, 3.0, 22],
                        ['Pear', 'Peel & Eat', 'EA', 1.3, 0.25, 
6.5]]),
                        columns=['fruit', 'operation', 'uom', 
'cycle_time', 'chng_time', 'target'])


# Set to multi-level index
df1.set_index(['fruit','operation'], inplace=True)
df2.set_index(['fruit','operation'], inplace=True)

# Update df1 with df2 'Time' values
df1.update(df2, overwrite=False)

# Reset df index
df1 = df1.reset_index()
df2 = df2.reset_index()

'''

Please [avoid including your data as an image](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). Pasting it as text, and using code formatting is preferable. — Nick ODell, Oct 25 '22 at 21:55
`The dataframes are too large to merge` What do you mean by that? Do you run out of memory while merging them? — Nick ODell, Oct 25 '22 at 22:00
Hi Nick, thanks for the question and comment on images. The dataframes 'being too large' is really a matter of preference. df1 is 'actual' time data, where df2 is 'ideal' time, we're comparing actual to ideal times. In addition to combining actual to ideal data, the combined data is less easy to read. Thanks again. — RamenRich, Oct 26 '22 at 02:08
Hi @Naveed, the solution you posted worked well for my example data. However, when I tried to use it on my actual, I'm getting an error: ValueError: cannot handle a non-unique multi-index! I've been trying to troubleshoot it today without any luck, any chance you might know what would raise this error? I've tried to change the data type for a few of the columns, but that doesn't appear to be the issue. Thanks for checking back with me. — RamenRich, Oct 27 '22 at 01:51
@Naveed, it might also be important to mention, the error is raised on df.update, this is where python has an issue with the multi-index. — RamenRich, Oct 27 '22 at 02:03
Update matches the two DF on index, which needs to be unique, so that the two DF can map to each other. Is there another key that you can add in the index to make it unique on both DF? Think from the stand point of updating it in a for loop, how you’ll match the two DF, if indexes are not unique. — Naveed, Oct 27 '22 at 02:44
@Naveed, this solution, and your recommendation on unique index combinations worked well, thank you! I'd like to post some sample code as a solution, where is the best place, in the 'answer' section below? — RamenRich, Oct 27 '22 at 03:37

score 1 · Accepted Answer · answered Oct 26 '22 at 02:08

if you provide data as a code (preferably) or as a text. I would be able to share the result as well

# set the index on the column that you like to update NaN values on

df.set_index(['PN','Op'], inplace=True)
df2.set_index(['PN','Op'], inplace=True)

# use update to update the null values in DF, with corresponding values from DF2
df.update(df2, overwrite=False)

df.reset_index()

Pandas replace NaN values with data from a different dataframe

1 Answers1