Coalesce (SQL) functionality for Python Pandas

Question

All,

I was able to find a function called "combine_first()" in the pandas documentation as well as stackoverflow. This works great for only a few logical example. I was able to get the below to work combining the "combine_first()" function multiple times (in this case 6). Can someone assist in finding a more elegant solution?

The result of the created variable "category_id" should contain the first non-missing value starting with the last variable (category_id7) and ascending to the first. If category_id(x) is populated, category_id should take that value and stop processing for every row in the dataframe.

d={'category_id1':[32991,32991,32991,32991,32991],
   'category_id2':[22,22,22,22,22],
   'category_id3':[33058,51,121,120,32438],
   'category_id4':[np.nan,np.nan,np.nan,np.nan,np.nan],
   'category_id5':[np.nan,np.nan,np.nan,np.nan,np.nan],
   'category_id6':[np.nan,np.nan,np.nan,np.nan,np.nan],
   'category_id7':[np.nan,np.nan,np.nan,np.nan,np.nan]
  }
df=pd.DataFrame(data=d)
df['category_id']=df.category_id7.combine_first(df.category_id6).combine_first(df.category_id5).combine_first(df.category_id4).combine_first(df.category_id3).combine_first(df.category_id2).combine_first(df.category_id1)
print(df)

score 3 · Accepted Answer · answered Nov 22 '17 at 10:05

You're trying to cascade from the back. So I reverse the order of the columns with iloc. I follow that up with pd.DataFrame.notnull() to identify which cells are not null. When I subsequently run pd.DataFrame.idxmax, I find all the column names for the first non null value in each row, starting from the back. Finally, I use pd.DataFrame.lookup to find the values associated with the found columns.

df.assign(
    category_id=df.iloc[:, ::-1].notnull().idxmax(1).pipe(
        lambda d: df.lookup(d.index, d.values)
    )
)

   category_id1  category_id2  category_id3  category_id4  category_id5  category_id6  category_id7  category_id
0         32991            22         33058           NaN           NaN           NaN           NaN        33058
1         32991            22            51           NaN           NaN           NaN           NaN           51
2         32991            22           121           NaN           NaN           NaN           NaN          121
3         32991            22           120           NaN           NaN           NaN           NaN          120
4         32991            22         32438           NaN           NaN           NaN           NaN        32438

I don't fully understand all the functionality, however it works flawlessly. Much appreciated! — Kyle, Nov 22 '17 at 13:19

Coalesce (SQL) functionality for Python Pandas

1 Answers1