Remove the requirement to loop through numpy array

Question

Overview

The code below contains a numpy array clusters with values that are compared against each row of a pandas Dataframe using np.where. The SoFunc function returns rows where all conditions are True and takes the clusters array as input.

Question

I can loop through this array to compare each array element against the respective np.where conditions. How do I remove the requirement to loop but still get the same output?

I appreciate looping though numpy arrays is inefficient and want to improve this code. The actual dataset will be much larger.

Prepare the reproducible mock data

def genMockDataFrame(days,startPrice,colName,startDate,seed=None): 

    periods = days*24
    np.random.seed(seed)
    steps = np.random.normal(loc=0, scale=0.0018, size=periods)
    steps[0]=0
    P = startPrice+np.cumsum(steps)
    P = [round(i,4) for i in P]

    fxDF = pd.DataFrame({ 
        'ticker':np.repeat( [colName], periods ),
        'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
        'price':(P)})
    fxDF.index = pd.to_datetime(fxDF.date)
    fxDF = fxDF.price.resample('D').ohlc()
    fxDF.columns = [i.title() for i in fxDF.columns]
    return fxDF


def SoFunc(clust):
    #generate mock data
    df = genMockDataFrame(10,1.1904,'eurusd','19/3/2020',seed=157)
    df["Upper_Band"] = 1.1928
    df.loc["2020-03-27", "Upper_Band"] = 1.2118
    df.loc["2020-03-26", "Upper_Band"] = 1.2200
    df["Level"] = np.where((df["High"] >= clust)
                                      & (df["Low"] <= clust)
                                     & (df["High"] >= df["Upper_Band"] ),1,np.NaN
                                      )
    return df.dropna()

Loop through the clusters array

clusters = np.array([1.1929   , 1.2118 ])

l = []

for i in range(len(clusters)):
    l.append(SoFunc(clusters[i]))
    
pd.concat(l)

Output

              Open  High    Low    Close    Upper_Band  Level
date                        
2020-03-19  1.1904  1.1937  1.1832  1.1832  1.1928      1.0
2020-03-25  1.1939  1.1939  1.1864  1.1936  1.1928      1.0
2020-03-27  1.2118  1.2144  1.2039  1.2089  1.2118      1.0

I can not reproduce it, can you give minimal reproducible code? — eroot163pi, Aug 28 '21 at 14:13
You should be able to copy and paste and run this as is. It's data generated with numpy and pandas. What error are you getting? I'm away from my laptop but will check your errors when home. — nipy, Aug 28 '21 at 14:20
Replace lines with `df.loc["2020-03-27", "Upper_Band"] = 1.2118` — eroot163pi, Aug 28 '21 at 14:26

Salmonstrikes · Accepted Answer · 2021-08-31T07:57:26.550

(Edited based on @tdy's comment below)

pandas.merge allows you to make len(clusters) copies of your dataframe and then pare it down to according to the conditions in your SoFunc function.

The cross merge creates a dataframe with a copy of df for each record in clusters_df. The overall result ought to be faster for large dataframes than the loop-based approach, provided you have enough memory to temporarily accommodate the merged dataframe (if not, the operation may spill over onto page / swap and slow down drastically).

import numpy as np
import pandas as pd

def genMockDataFrame(days,startPrice,colName,startDate,seed=None): 
    ''' identical to the example provided '''

    periods = days*24
    np.random.seed(seed)
    steps = np.random.normal(loc=0, scale=0.0018, size=periods)
    steps[0]=0
    P = startPrice+np.cumsum(steps)
    P = [round(i,4) for i in P]

    fxDF = pd.DataFrame({ 
        'ticker':np.repeat( [colName], periods ),
        'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
        'price':(P)})
    fxDF.index = pd.to_datetime(fxDF.date)
    fxDF = fxDF.price.resample('D').ohlc()
    fxDF.columns = [i.title() for i in fxDF.columns]
    return fxDF
    
# create the base dataframe according to the former SoFunc
df = genMockDataFrame(10,1.1904,'eurusd','19/3/2020',seed=157)
df["Upper_Band"] = 1.1928
df.loc["2020-03-27"]["Upper_Band"] = 1.2118
df.loc["2020-03-26"]["Upper_Band"] = 1.2200

# create a df out of the cluster array
clusters = np.array([1.1929   , 1.2118 ])
clusters_df = pd.DataFrame({"clust": clusters})

# perform the merge, then filter and finally clean up
result_df = (
    pd
    .merge(df.reset_index(), clusters_df, how="cross") # for each entry in cluster, make a copy of df
    .loc[lambda z: (z.Low <= z.clust) & (z.High >= z.clust) & (z.High >= z.Upper_Band), :] # filter the copies down
    .drop(columns=["clust"]) # not needed in result
    .assign(Level=1.0) # to match your result; not really needed
    .set_index("date") # bring back the old index
)

print(result_df)

I recommend inspecting just the result of pd.merge(df.reset_index(), clusters_df, how="cross") to see how it works.

+1, but can avoid creating/dropping the `temp` columns by ["cross" merging](https://stackoverflow.com/a/13270110/13138364): `pd.merge(df.reset_index(), clusters_df, how="cross").loc[...` — tdy, Aug 31 '21 at 07:24
Ah, it's great that they introduced the cross merge. Thanks for the heads up -- I've updated the answer accordingly. — Salmonstrikes, Aug 31 '21 at 08:02

Remove the requirement to loop through numpy array

1 Answers1