Averaging list of pandas dataframes where which rows should be averaged depend on condition

Question

Let dfList = [df0, df1, df2, df3, df4] be a list of pandas dataframes with the same columns and index.

I would like to average them, but some dataframes are outliers for some rows.

I have a list of non-outliers: myList = [ [0,1] , [1,2,3], [0,2,4] ] showing which dataframes should be averaged for each row.

So,

result.iloc[0,:] = average( df0.iloc[0,:] , df1[0,:] )
result.iloc[1,:] = average( df1.iloc[1,:] , df2.iloc[1,:] , df3.iloc[1,:] )
result.iloc[2,:] = average( df0.iloc[2,:] , df2.iloc[2,:] , df4.iloc[2,:] )

How can I calculate the above described average in terms of dfList and myList?

Does your existing calculation already work, but you are just trying to make the dataframes used in the calculations dynamic? — Kurt Kline, May 17 '20 at 13:52
Using several for loop's and if's (to iterate over the dataframes and cells) I could produce the desired output, but it is very slow and obviously inefficient. — ThePortakal, May 17 '20 at 13:56
@ThePortakal Hey! you still facing this problem? Did you get a chance to try out my solution? — Partha Mandal, May 18 '20 at 22:47
@ParthaMandal Sorry for the delayed response. Thank you for your kind answer, and your solution worked great. — ThePortakal, May 19 '20 at 02:13

score 1 · Accepted Answer · answered May 17 '20 at 17:50

This basically screams matrix multiplication! This should work for your case:

import pandas as pd, numpy as np
df1 = pd.DataFrame([[0,1,2],[1,2,4],[1,1,1],[1,1,1]])
df2 = pd.DataFrame([[10,10,10],[1,2,4],[1,1,1],[0,0,0]])
df3 = pd.DataFrame([[1,1,1],[1,2,4],[1,1,1],[0,0,0]])
df_list = [df1,df2,df3]

arr = np.stack(df_list)
print(arr)

lst = [[0,1],[1,2],[0,1,2],[0]]
_max = len(df_list)

mult = []
for _ in lst:
  _lst = []
  for i in range(_max):
    if i in _:
      _lst.append(1)
    else:
      _lst.append(0)
  mult.append(_lst)

mult = np.array(mult)
print(mult)
mult = (mult.T/np.sum(mult,axis=1)).T
print(mult)

result = np.einsum('jik,ij->ik',arr,mult)
print(result)

Averaging list of pandas dataframes where which rows should be averaged depend on condition

1 Answers1