I recently read a fantastic article (https://towardsdatascience.com/apply-function-to-pandas-dataframe-rows-76df74165ee4) which suggests that vectorizing is much faster than itertuples and was hoping practice. My current code, on 2 million rows, takes about 16 hours to complete on the following sample saved in the pandas DataFrame object "data":
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2274589 entries, 0 to 2274588
Data columns (total 5 columns):
# Column Dtype
--- ------ -----
0 Date object
1 Time object
2 Open float64
3 Close float64
4 datetime datetime64[ns]
print(data)
Date Time Open Close datetime
0 02/10/2012 07:26 191.9500 191.9500 2012-02-10 07:26:00
1 02/10/2012 07:56 191.6600 191.6600 2012-02-10 07:56:00
2 02/10/2012 08:00 191.9400 191.9400 2012-02-10 08:00:00
3 02/10/2012 09:30 191.7500 191.7500 2012-02-10 09:30:00
4 02/10/2012 09:54 191.8500 191.8500 2012-02-10 09:54:00
Working code removes time before 9:30AM and after 3:59PM:
keep=[]
end = data.shape[0]
for row in data.itertuples(index=True):
if (row.datetime < datetime(year = row.datetime.year, month = row.datetime.month, day = row.datetime.day, hour = 9, minute = 30, second = 0)):
pass
elif (row.datetime.hour > 16): # closes at 15:59 (keep it!) in this database's notation
pass
else:
keep.append(row[0])
print(row[0], "/", end)
data = data.loc[keep, :]
Vectorizing is new to me, and I've tried some operations, but I feel like since it's a Series, comparisons or setting values are an issue, since it's not one individual number. From reading, it seems I need to make a function, so that I can do: data['keep_it'] = my_fun(data['datetime'])
Failing attempts:
data['keep_it'] = my_fun(data['datetime'])
def my_fun(row): # returns 1 if desired to keep # a vectorized approach
if (row < datetime.date(year = row.year, month = row.month, day = row.day, hour = 9, minute = 30, second = 0)):
return 1
# AttributeError: 'Series' object has no attribute 'year'
if (row < pd.to_datetime(str(row['datetime'].year) +'/' + str(row['datetime'].month) +'/' + str(row['datetime'].day) + 'T9:30:00')):
return 1
# AttributeError: 'Series' object has no attribute 'year'
Any thoughts? Thanks!