I have a dataframe "df" that has many thousands of rows and NaN values in "Waist":
Dataframe "df"
Height Weight Bust Cup Waist Sales Returns
64 130 34 4 NaN 10 1
65 130 34 3 27 11 4
66 130 34 3 26 9 5
65 130 34 3 NaN 3 2
64 130 34 3 28 5 1
64 130 34 3 NaN 6 1
67 129 33 4 26 7 4
68 130 33 3 27 6 3
66 130 33 3 27 4 3
I have a second dataframe "dfwaist" which is smaller in both number of rows and columns. 'dfwaist' only contains Height, Weight, Bust, Cup, Waist, and Count.
I'd like to return the value of "Waist" from 'dfwaist' where the values for Height, Weight, Bust, and Cup match 'df' and replace the NaN values in 'df' with the returned values. If there is no match, leave the NaN in place.
Dataframe 'dfwaist'
Height Weight Bust Cup Waist Count
64 130 34 4 27 79
65 130 34 3 27 76
66 130 34 3 26 72
65 130 34 3 26 63
64 130 34 3 28 55
64 130 34 3 28 46
67 129 33 4 26 47
68 130 33 3 27 36
66 130 33 3 27 24
Desired output for 'df'
Height Weight Bust Cup Waist Sales Returns
64 130 34 4 27 10 1
65 130 34 3 27 11 4
66 130 34 3 26 9 5
65 130 34 3 26 3 2
64 130 34 3 28 5 1
64 130 34 3 28 6 1
67 129 33 4 26 7 4
68 130 33 3 27 6 3
66 130 33 3 27 4 3
Is there a way to return and overwrite NaN values in my first data frame with those found in the second conditionally?
I've already tried .combine_first() and .merge() and both fail because both assume the same shape between the data frames as is suggested in Pandas fill missing values in dataframe from another dataframe
.fillna() also assumes the same shape between the dataframes as is suggested in proper way to replace NaN value from another dataframe on column match in pandas It also requires a common index to be set, but a common index doesn't exist between the two dataframes.
.combine_first() is promising, but it would fill at most one of the rows in "df" and wouldn't respect the combination of Height, Weight, Bust, and Cup required before combining "df" and "dfwaist". This was suggested in how can I replace NaN value with data from another dataframe in python?
replacing values in a pandas dataframe with values from another dataframe based common columns will not preserve the order of the data, and requires the creation of a third dataframe.
Lastly, I've tried disregarding the second data frame entirely and filling NaN values with the .mode() of the Waist column when the combination of Height, Weight, Bust, and Cup is grouped in a .groupby(). The first V1 fails because of a key error: 0. The second fails to run. The third runs, but imputes NaN values over previously "good" values.
# For every Height, Weight, Bust, Cup combination, populate the 'Waist' feature with the mode of the combination.
# V1
#df['Waist'] = df.groupby(['Height', 'Weight', 'Bust', 'Cup','Waist']).Waist.apply(lambda x: x.fillna(x.mode()[]))
# V2
#df['Waist'] = df.groupby(['Height', 'Weight', 'Bust', 'Cup','Waist'])['Waist'].apply(lambda x: x.fillna(x.mode([0])))
# V3
#df['Waist'] = df.groupby(['Height','Weight','Bust','Cup','Waist']).agg(pd.Series.mode).apply(lambda x: x.fillna(x.mode([4])))