Can I implement the explicit for-loops seen in step 4 using a vectorization approach?
Simple dataset creation:
- Declare a DataFrame with a MultiIndex index and the same MultiIndex column. The data are also symmetric.
import numpy as np
import pandas as pd
toy_dict={
'a':[np.nan,3,4,-8,np.inf,np.nan,-8,9],
'b':[3,np.nan,-3,27,-9,np.nan,9,2],
'c':[4,-3,np.nan,3,2,-5,-7,3],
'd':[-8,27,3,np.nan,2,1,-10,12],
'e':[np.inf,-9,2,2,np.nan,3,7,np.nan],
'f':[np.nan,np.nan,-5,1,3,np.nan,7,9],
'g':[-8,9,-7,-10,7,7,np.nan,2],
'h':[9,2,3,12,np.nan,9,2,np.nan]
}
toy_panda=pd.DataFrame.from_dict(toy_dict)
index_tuple=(
('a','a'),
('a','b'),
('a','c'),
('a','d'),
('b','a'),
('b','b'),
('b','c'),
('b','d'),
)
my_MultiIndex=pd.MultiIndex.from_tuples(index_tuple)
toy_panda.set_index(my_MultiIndex,inplace=True)
toy_panda.columns=my_MultiIndex
- I declare a list of lists-that-i-want-to-subset-by
list_of_indices_lists=[
[('a','a'),('a','b')],
[('b','c')],
[('a','a'),('a','b'),('a','d')],
[('b','b'),('b','c')]
]
Code that I want to vectorize
- I use nested for loops to iterate over my subset-list, starting the inner loop at the current index of the outerloop.
- For each subset, I apply np.select with a list of criteria
def one_df_aggregate(temp_df):
'''
given an numpy array, chooses what the aggregate value is
'''
print(temp_df)
conditions=[
np.isnan(temp_df).any(axis=None),
(temp_df==np.inf).all(axis=None),
(temp_df==-np.inf).all(axis=None),
((temp_df<0).any(axis=None) and (temp_df>0).any(axis=None)),
(temp_df==0).any(axis=None),
(temp_df>0).all(axis=None),
(temp_df<0).all(axis=None)
]
choices=[
np.nan,
np.inf,
-np.inf,
0,
0,
temp_df.values.min(),
temp_df.values.max()
]
return np.select(conditions,choices)
for i in range(len(list_of_indices_lists)):
for j in range(i,len(list_of_indices_lists)):
list_of_results.append(
one_df_aggregate(
toy_panda.loc[
toy_panda.index.isin(list_of_indices_lists[i]),
toy_panda.columns.isin(list_of_indices_lists[j])
]
)
)
Result Running these for-loops on the example dataset gives the accurate result
[array(nan), array(0.), array(nan), array(nan), array(nan), array(0.), array(nan), array(nan), array(nan), array(nan)]
But it is not vectorized, so I know that it will be slow.