I have a Pandas dataframe that I create from CSV file. For various reasons, I convert the 'purchase_date' column from string to datetime.date object and drop the time - like this:
shoppingData_df[dateCol] = pd.to_datetime(shoppingData_df[dateCol])
# drop time from datetime col.
shoppingData_df[dateCol] = shoppingData_df[dateCol].apply(lambda x: x.date())
Now, I want to do some filtering operations on shoppingData_df - based on start and end dates. I have control on the format of start_date and end_date (could be either string or datetime obj). I want to know how I can filter a dataframe that has datetime column in it. I know I can filter a dataframe when it has string dates column in it like this:
start_date_str = '2016-03-22'
end_date_str = '2016-03-30'
print ("\n\n Using dataframes with string objects.")
mask = (df['purchase_date'] > start_date_str) & (df['purchase_date'] < end_date_str)
dfs = df[mask]
print dfs
NOTE: My dataset is huge (millions of rows). I do not want to make a copy or transform the purchase_date column back to strings..and such just for subsetting. It may take a long time. I also do not want to make the 'purchase_date' column as an index (and then do df[start_date: end_date]).
Question: Is there a way to do subsetting/filtering on a dataframe with datetime object column? Something like:
import datetime as dt
print ("\n\n Using dataframes with datetime objects.")
start_date_str = '2016-03-22'
end_date_str = '2016-03-30'
start_date_obj = dt.datetime.strptime(start_date_str, '%Y-%m-%d')
end_date_obj = dt.datetime.strptime(end_date_str, '%Y-%m-%d')
mask = (df['purchase_date'] > start_date_obj) & (df['purchase_date'] < end_date_obj)
dfs = df[mask]
print dfs
When I run the above snippet, I do not get an error, but an empty df. I am print out the types of the column and the type of the condition-variable (start_date).
print type(df['purchase_date'] -- <class 'pandas.core.series.Series'>
print type(start_date_obj) -- <type 'datetime.date'>
print type(end_date_obj) -- <type 'datetime.date'>
Empty DataFrame
Columns: [purchase_date, total_purchase, num_customers, cust_id]
Index: []
AusMBP:~/av/exps/py$
I am expecting some values in the resulting subset df. I am not sure if comparing a Series with a datetime.date object would cause problems.