0

Let's say I have my main DataFrame.

df = pd.DataFrame({'ID': [1,1,1,2,2,2,3,3,3], 
'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-01', '2021-01-02', '2021-01-03','2021-01-01', '2021-01-02', '2021-01-03'] ,
'Values': [11, np.nan, np.nan, 13, np.nan, np.nan, 15, np.nan, np.nan], 
'Random_Col': [0,0,0,0,0,0,0,0,0]})

I want to fill the np.nan values with values from another dataframe that is not the same shape. The values have to match on "ID" and "Date".

new_df = pd.DataFrame({'ID': [1,1,2,2,3,3], 
'Date': ['2021-01-02', '2021-01-03', '2021-01-02', '2021-01-03','2021-01-02','2021-01-03'],
'Values': [16, 19, 14, 14, 19, 18]})

What's the best way to do this?

I experimented with df.update(), but I'm not that works since the dataframes do not have the same number of rows. Am I wrong about this?

I could also use pd.merge(), but then I end up with multiple versions of each column and have to .fillna() for each specific column with the 2nd column with the new values. This would be fine if I only had 1 column of data to do this for, but I have dozens.

Is there a simpler way that I haven't considered?

Ragnar Lothbrok
  • 1,045
  • 2
  • 16
  • 31
  • If you know each index will match `pandas.Series.combine_first` is a worth a look – Paul H Apr 13 '22 at 22:34
  • not sure if this might be of interest https://stackoverflow.com/questions/29357379/pandas-fill-missing-values-in-dataframe-from-another-dataframe – jspcal Apr 13 '22 at 22:35

1 Answers1

1

One option is to merge + sort_index + bfill to fill the missing data in df, then reindex with df.columns. Since '\x00' has the lowest value, the sorting should place the same column names next to each other.

out = (df.merge(new_df, on=['ID','Date'], how='left', suffixes=('','\x00'))
       .sort_index(axis=1).bfill(axis=1)[df.columns])

Output:

  ID        Date Values Random_Col
0  1  2021-01-01   11.0          0
1  1  2021-01-02   16.0          0
2  1  2021-01-03   19.0          0
3  2  2021-01-01   13.0          0
4  2  2021-01-02   14.0          0
5  2  2021-01-03   14.0          0
6  3  2021-01-01   15.0          0
7  3  2021-01-02   19.0          0
8  3  2021-01-03   18.0          0
  • This is interesting, but this solution only works if all the columns are adjacent to one another, correct? It works for my example, but in a real world exercise with dozens of columns like this, and I have something like columns = ['ID', 'Date', 'col1', 'col2', 'col3', etc], then it would simply impute nulls from 'col1' with data from 'col2', rather than 'col1_', right? – Ragnar Lothbrok Apr 13 '22 at 23:16
  • That does work. Quite clever! Thank you! – Ragnar Lothbrok Apr 13 '22 at 23:41