1

I recently read a fantastic article (https://towardsdatascience.com/apply-function-to-pandas-dataframe-rows-76df74165ee4) which suggests that vectorizing is much faster than itertuples and was hoping practice. My current code, on 2 million rows, takes about 16 hours to complete on the following sample saved in the pandas DataFrame object "data":

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2274589 entries, 0 to 2274588
Data columns (total 5 columns):
 #   Column    Dtype         
---  ------    -----         
 0   Date      object        
 1   Time      object        
 2   Open      float64       
 3   Close     float64       
 4   datetime  datetime64[ns]
print(data)
               Date   Time      Open     Close            datetime
0        02/10/2012  07:26  191.9500  191.9500 2012-02-10 07:26:00
1        02/10/2012  07:56  191.6600  191.6600 2012-02-10 07:56:00
2        02/10/2012  08:00  191.9400  191.9400 2012-02-10 08:00:00
3        02/10/2012  09:30  191.7500  191.7500 2012-02-10 09:30:00
4        02/10/2012  09:54  191.8500  191.8500 2012-02-10 09:54:00

Working code removes time before 9:30AM and after 3:59PM:

keep=[]
end = data.shape[0]
for row in data.itertuples(index=True):
    if (row.datetime < datetime(year = row.datetime.year, month = row.datetime.month, day = row.datetime.day, hour = 9, minute = 30, second = 0)):
        pass
    elif (row.datetime.hour > 16): # closes at 15:59 (keep it!) in this database's notation
        pass
    else:
        keep.append(row[0])
    print(row[0], "/", end)
data = data.loc[keep, :] 

Vectorizing is new to me, and I've tried some operations, but I feel like since it's a Series, comparisons or setting values are an issue, since it's not one individual number. From reading, it seems I need to make a function, so that I can do: data['keep_it'] = my_fun(data['datetime'])

Failing attempts:

data['keep_it'] = my_fun(data['datetime'])
def my_fun(row): # returns 1 if desired to keep  # a vectorized approach
    if (row < datetime.date(year = row.year, month = row.month, day = row.day, hour = 9, minute = 30, second = 0)):
        return 1
     # AttributeError: 'Series' object has no attribute 'year'
    if (row < pd.to_datetime(str(row['datetime'].year) +'/' + str(row['datetime'].month) +'/' + str(row['datetime'].day) + 'T9:30:00')):
        return 1
    # AttributeError: 'Series' object has no attribute 'year'

Any thoughts? Thanks!

OatMeal
  • 45
  • 6

2 Answers2

1

This is vectorised.

import datetime as dt

df = pd.read_csv(io.StringIO("""    Date   Time      Open     Close            datetime
0        02/10/2012  07:26  191.9500  191.9500  2012-02-10 07:26:00
1        02/10/2012  07:56  191.6600  191.6600  2012-02-10 07:56:00
2        02/10/2012  08:00  191.9400  191.9400  2012-02-10 08:00:00
3        02/10/2012  09:30  191.7500  191.7500  2012-02-10 09:30:00
4        02/10/2012  09:54  191.8500  191.8500  2012-02-10 09:54:00"""), sep="\s\s+", engine="python")

df["datetime"] = pd.to_datetime(df["datetime"])
df.loc[df["datetime"].dt.time.between(dt.time(9,30),dt.time(15,59))]

Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
1

Thanks to @MrFuppes, I have devised this crude method which is essentially instantaneous:

testing = pd.DatetimeIndex(data['datetime'])
data = data[(testing.hour<16) & (testing.hour*60+testing.minute >= 9*60+30)] 

Room for improvement includes removing the testing with a one liner, and perhaps properly utilizing the DateTimeIndex .time attribute

OatMeal
  • 45
  • 6