0

Can I implement the explicit for-loops seen in step 4 using a vectorization approach?

Simple dataset creation:

  1. Declare a DataFrame with a MultiIndex index and the same MultiIndex column. The data are also symmetric.
import numpy as np
import pandas as pd

toy_dict={
    'a':[np.nan,3,4,-8,np.inf,np.nan,-8,9],
    'b':[3,np.nan,-3,27,-9,np.nan,9,2],
    'c':[4,-3,np.nan,3,2,-5,-7,3],
    'd':[-8,27,3,np.nan,2,1,-10,12],
    'e':[np.inf,-9,2,2,np.nan,3,7,np.nan],
    'f':[np.nan,np.nan,-5,1,3,np.nan,7,9],
    'g':[-8,9,-7,-10,7,7,np.nan,2],
    'h':[9,2,3,12,np.nan,9,2,np.nan]



}
toy_panda=pd.DataFrame.from_dict(toy_dict)

index_tuple=(
    ('a','a'),
    ('a','b'),
    ('a','c'),
    ('a','d'),
    ('b','a'),
    ('b','b'),
    ('b','c'),
    ('b','d'),
)
my_MultiIndex=pd.MultiIndex.from_tuples(index_tuple)
toy_panda.set_index(my_MultiIndex,inplace=True)
toy_panda.columns=my_MultiIndex

  1. I declare a list of lists-that-i-want-to-subset-by
list_of_indices_lists=[
    [('a','a'),('a','b')],
    [('b','c')],
    [('a','a'),('a','b'),('a','d')],
    [('b','b'),('b','c')]
]

Code that I want to vectorize

  1. I use nested for loops to iterate over my subset-list, starting the inner loop at the current index of the outerloop.
  2. For each subset, I apply np.select with a list of criteria
def one_df_aggregate(temp_df):
    '''
    given an numpy array, chooses what the aggregate value is
    '''
    print(temp_df)
    conditions=[
        np.isnan(temp_df).any(axis=None),
        (temp_df==np.inf).all(axis=None),
        (temp_df==-np.inf).all(axis=None),
        ((temp_df<0).any(axis=None) and (temp_df>0).any(axis=None)),
        (temp_df==0).any(axis=None),
        (temp_df>0).all(axis=None),
        (temp_df<0).all(axis=None)
    ]
    
    choices=[
        np.nan,
        np.inf,
        -np.inf,
        0,
        0,
        temp_df.values.min(),
        temp_df.values.max()
    ]
    return np.select(conditions,choices)


for i in range(len(list_of_indices_lists)):
    for j in range(i,len(list_of_indices_lists)):

        list_of_results.append(
            one_df_aggregate(
                toy_panda.loc[
                    toy_panda.index.isin(list_of_indices_lists[i]),
                    toy_panda.columns.isin(list_of_indices_lists[j])
                ]
            )
        )

Result Running these for-loops on the example dataset gives the accurate result

[array(nan), array(0.), array(nan), array(nan), array(nan), array(0.), array(nan), array(nan), array(nan), array(nan)]

But it is not vectorized, so I know that it will be slow.

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
rictuar
  • 74
  • 6
  • 1
    what's the expected outuput dataframe – sammywemmy Oct 13 '21 at 21:28
  • The bare minimum is simply a list of the results from the select. Because I know the order in which the test was run (this subset vs that subset, then this subset vs that subset), I can ultimately create something with the form "Outer Loop Selection Index | Inner Loop Selection Index | np.select result" – rictuar Oct 13 '21 at 21:29
  • 1
    But, can you give us a black box, 'here's what I'm starting with, and here's the question I'm trying to answer' or 'here is what the final df should look like' – Nesha25 Oct 13 '21 at 23:20
  • I see. Start with the dataset produced in Step 1 and the list of subsets in Step 2. Then, executing the the code in Steps 3/4 yields [array(nan), array(0.), array(nan), array(nan), array(nan), array(0.), array(nan), array(nan), array(nan), array(nan)] Which is a fine result in terms of accuracy, but what I want to know is if I can express this test without using explicit for loops in order to dramatically speed up calculation time. This example is much simpler than my actual dataset. – rictuar Oct 14 '21 at 00:34
  • Hi and welcome on SO. It will be great if you can have a look at [ask] and then try to produce a [mcve]. – rpanai Oct 14 '21 at 00:48
  • @rpanai Ok I think that I have simplified my question. I hope this helps. – rictuar Oct 14 '21 at 01:29
  • Your code is not reproducible. What is `my_MultiIndex`? – Corralien Oct 14 '21 at 07:34
  • ARGH. Somehow I dropped a line in my edits. my_MultiIndex is now declared in Section 1. – rictuar Oct 14 '21 at 16:35

1 Answers1

0

After extensive research, I am learning toward "No" as the answer. The reason is that this problem essentially involves vectorizing masking/slicing, and this is not possible when the results have different dimensions

vectorized indexing/slicing in numpy/scipy?

As an alternative strategy, I'm wondering if I can set values that would have been dropped with loc to some sort of value that I ignore.

rictuar
  • 74
  • 6