0

Let dfList = [df0, df1, df2, df3, df4] be a list of pandas dataframes with the same columns and index.

I would like to average them, but some dataframes are outliers for some rows.

I have a list of non-outliers: myList = [ [0,1] , [1,2,3], [0,2,4] ] showing which dataframes should be averaged for each row.

So,

result.iloc[0,:] = average( df0.iloc[0,:] , df1[0,:] )
result.iloc[1,:] = average( df1.iloc[1,:] , df2.iloc[1,:] , df3.iloc[1,:] )
result.iloc[2,:] = average( df0.iloc[2,:] , df2.iloc[2,:] , df4.iloc[2,:] )

How can I calculate the above described average in terms of dfList and myList?

ThePortakal
  • 229
  • 2
  • 10
  • Does your existing calculation already work, but you are just trying to make the dataframes used in the calculations dynamic? – Kurt Kline May 17 '20 at 13:52
  • Using several for loop's and if's (to iterate over the dataframes and cells) I could produce the desired output, but it is very slow and obviously inefficient. – ThePortakal May 17 '20 at 13:56
  • @ThePortakal Hey! you still facing this problem? Did you get a chance to try out my solution? – Partha Mandal May 18 '20 at 22:47
  • 1
    @ParthaMandal Sorry for the delayed response. Thank you for your kind answer, and your solution worked great. – ThePortakal May 19 '20 at 02:13

1 Answers1

1

This basically screams matrix multiplication! This should work for your case:

import pandas as pd, numpy as np
df1 = pd.DataFrame([[0,1,2],[1,2,4],[1,1,1],[1,1,1]])
df2 = pd.DataFrame([[10,10,10],[1,2,4],[1,1,1],[0,0,0]])
df3 = pd.DataFrame([[1,1,1],[1,2,4],[1,1,1],[0,0,0]])
df_list = [df1,df2,df3]

arr = np.stack(df_list)
print(arr)

lst = [[0,1],[1,2],[0,1,2],[0]]
_max = len(df_list)

mult = []
for _ in lst:
  _lst = []
  for i in range(_max):
    if i in _:
      _lst.append(1)
    else:
      _lst.append(0)
  mult.append(_lst)

mult = np.array(mult)
print(mult)
mult = (mult.T/np.sum(mult,axis=1)).T
print(mult)

result = np.einsum('jik,ij->ik',arr,mult)
print(result)

Partha Mandal
  • 1,391
  • 8
  • 14