0

I have a Pandas dataframe that I create from CSV file. For various reasons, I convert the 'purchase_date' column from string to datetime.date object and drop the time - like this:

shoppingData_df[dateCol] = pd.to_datetime(shoppingData_df[dateCol])

# drop time from datetime col.
shoppingData_df[dateCol] = shoppingData_df[dateCol].apply(lambda x: x.date())

Now, I want to do some filtering operations on shoppingData_df - based on start and end dates. I have control on the format of start_date and end_date (could be either string or datetime obj). I want to know how I can filter a dataframe that has datetime column in it. I know I can filter a dataframe when it has string dates column in it like this:

start_date_str = '2016-03-22'
end_date_str = '2016-03-30'

print ("\n\n Using dataframes with string objects.")
mask = (df['purchase_date'] > start_date_str) & (df['purchase_date'] < end_date_str)
dfs = df[mask]
print dfs

NOTE: My dataset is huge (millions of rows). I do not want to make a copy or transform the purchase_date column back to strings..and such just for subsetting. It may take a long time. I also do not want to make the 'purchase_date' column as an index (and then do df[start_date: end_date]).

Question: Is there a way to do subsetting/filtering on a dataframe with datetime object column? Something like:

import datetime as dt


print ("\n\n Using dataframes with datetime objects.")

start_date_str = '2016-03-22'
end_date_str = '2016-03-30'
start_date_obj = dt.datetime.strptime(start_date_str, '%Y-%m-%d')
end_date_obj = dt.datetime.strptime(end_date_str, '%Y-%m-%d')

mask = (df['purchase_date'] > start_date_obj) & (df['purchase_date'] < end_date_obj)
dfs = df[mask]
print dfs

When I run the above snippet, I do not get an error, but an empty df. I am print out the types of the column and the type of the condition-variable (start_date).

print type(df['purchase_date'] -- <class 'pandas.core.series.Series'>
print type(start_date_obj)  -- <type 'datetime.date'>
print type(end_date_obj)  -- <type 'datetime.date'>

Empty DataFrame
Columns: [purchase_date, total_purchase, num_customers, cust_id]
Index: []
AusMBP:~/av/exps/py$

I am expecting some values in the resulting subset df. I am not sure if comparing a Series with a datetime.date object would cause problems.

user1717931
  • 2,419
  • 5
  • 29
  • 40
  • 1
    I'm confused, did you get an error when you tried this? From this [post](http://stackoverflow.com/a/29370182/3453737) it seems like you shouldn't have a problem doing what you're trying to do. – shawnheide Sep 23 '16 at 01:53
  • Everything will work, even strings. When in doubt, try it out! – Kartik Sep 23 '16 at 06:04
  • Updated with the result I am seeing (empty df). @Kartik: I am trying. I think I need more understanding of the datatypes and possibilities. I am under the assumption that comparing a Series with datetime.date object should be OK. The system is not complaining, but, doesn't filter my data. I checked my dataset and I do have data in that range of dates. beats me. – user1717931 Sep 23 '16 at 12:35
  • Found the issue. There was an error in the upstream code where the start and end-dates were 'futuristic'...rather than what was in the dataframe. Apologies for overlooking this. Closing this question. – user1717931 Sep 23 '16 at 13:20

0 Answers0