3

Overview

The code below contains a numpy array clusters with values that are compared against each row of a pandas Dataframe using np.where. The SoFunc function returns rows where all conditions are True and takes the clusters array as input.

Question

I can loop through this array to compare each array element against the respective np.where conditions. How do I remove the requirement to loop but still get the same output?

I appreciate looping though numpy arrays is inefficient and want to improve this code. The actual dataset will be much larger.

Prepare the reproducible mock data

def genMockDataFrame(days,startPrice,colName,startDate,seed=None): 

    periods = days*24
    np.random.seed(seed)
    steps = np.random.normal(loc=0, scale=0.0018, size=periods)
    steps[0]=0
    P = startPrice+np.cumsum(steps)
    P = [round(i,4) for i in P]

    fxDF = pd.DataFrame({ 
        'ticker':np.repeat( [colName], periods ),
        'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
        'price':(P)})
    fxDF.index = pd.to_datetime(fxDF.date)
    fxDF = fxDF.price.resample('D').ohlc()
    fxDF.columns = [i.title() for i in fxDF.columns]
    return fxDF


def SoFunc(clust):
    #generate mock data
    df = genMockDataFrame(10,1.1904,'eurusd','19/3/2020',seed=157)
    df["Upper_Band"] = 1.1928
    df.loc["2020-03-27", "Upper_Band"] = 1.2118
    df.loc["2020-03-26", "Upper_Band"] = 1.2200
    df["Level"] = np.where((df["High"] >= clust)
                                      & (df["Low"] <= clust)
                                     & (df["High"] >= df["Upper_Band"] ),1,np.NaN
                                      )
    return df.dropna()

Loop through the clusters array

clusters = np.array([1.1929   , 1.2118 ])

l = []

for i in range(len(clusters)):
    l.append(SoFunc(clusters[i]))
    
pd.concat(l)

Output

              Open  High    Low    Close    Upper_Band  Level
date                        
2020-03-19  1.1904  1.1937  1.1832  1.1832  1.1928      1.0
2020-03-25  1.1939  1.1939  1.1864  1.1936  1.1928      1.0
2020-03-27  1.2118  1.2144  1.2039  1.2089  1.2118      1.0
eroot163pi
  • 1,791
  • 1
  • 11
  • 23
nipy
  • 5,138
  • 5
  • 31
  • 72

1 Answers1

2

(Edited based on @tdy's comment below)

pandas.merge allows you to make len(clusters) copies of your dataframe and then pare it down to according to the conditions in your SoFunc function.

The cross merge creates a dataframe with a copy of df for each record in clusters_df. The overall result ought to be faster for large dataframes than the loop-based approach, provided you have enough memory to temporarily accommodate the merged dataframe (if not, the operation may spill over onto page / swap and slow down drastically).

import numpy as np
import pandas as pd

def genMockDataFrame(days,startPrice,colName,startDate,seed=None): 
    ''' identical to the example provided '''

    periods = days*24
    np.random.seed(seed)
    steps = np.random.normal(loc=0, scale=0.0018, size=periods)
    steps[0]=0
    P = startPrice+np.cumsum(steps)
    P = [round(i,4) for i in P]

    fxDF = pd.DataFrame({ 
        'ticker':np.repeat( [colName], periods ),
        'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
        'price':(P)})
    fxDF.index = pd.to_datetime(fxDF.date)
    fxDF = fxDF.price.resample('D').ohlc()
    fxDF.columns = [i.title() for i in fxDF.columns]
    return fxDF
    
# create the base dataframe according to the former SoFunc
df = genMockDataFrame(10,1.1904,'eurusd','19/3/2020',seed=157)
df["Upper_Band"] = 1.1928
df.loc["2020-03-27"]["Upper_Band"] = 1.2118
df.loc["2020-03-26"]["Upper_Band"] = 1.2200

# create a df out of the cluster array
clusters = np.array([1.1929   , 1.2118 ])
clusters_df = pd.DataFrame({"clust": clusters})

# perform the merge, then filter and finally clean up
result_df = (
    pd
    .merge(df.reset_index(), clusters_df, how="cross") # for each entry in cluster, make a copy of df
    .loc[lambda z: (z.Low <= z.clust) & (z.High >= z.clust) & (z.High >= z.Upper_Band), :] # filter the copies down
    .drop(columns=["clust"]) # not needed in result
    .assign(Level=1.0) # to match your result; not really needed
    .set_index("date") # bring back the old index
)

print(result_df)

I recommend inspecting just the result of pd.merge(df.reset_index(), clusters_df, how="cross") to see how it works.

Salmonstrikes
  • 737
  • 1
  • 6
  • 25
  • 1
    +1, but can avoid creating/dropping the `temp` columns by ["cross" merging](https://stackoverflow.com/a/13270110/13138364): `pd.merge(df.reset_index(), clusters_df, how="cross").loc[...` – tdy Aug 31 '21 at 07:24
  • Ah, it's great that they introduced the cross merge. Thanks for the heads up -- I've updated the answer accordingly. – Salmonstrikes Aug 31 '21 at 08:02