0

I want to get a section of a dataframe that meets a certain requirement.

I want to do:

new_df = old_df[old_df.timevariable.date() == thisdateiwant]

Is there an efficient way to do this that works? the issue here is the .date() part. I've done this before using the same syntax but not with a modifier on the part of old_df. For example of old_df.timevariable is a datetime, then I could match this with a ==datetime but as I want a date, I need to modify each element in the dataframe, which the syntax doesn't like.

I know I could take it all out and have it loop through with a bunch of variables, but I'm pretty sure this would be much slower. The first code snippet seemed to be the fastest way of doing this (like a WHERE SQL clause), although doesn't seem to work if you need to modify the variable you're comparing (such as .date()).

The old_df is about (900k, 15) in size so I want to get something efficient. Currently, I'm just changing variables and reimporting from SQL which seems to take 5-10 seconds for each date (thisdateiwant). I presume something in python with the larger initial database will be quicker than this. Typically it returns about 30k rows into new_df for each date.

What is the fastest way of doing this?

Edit

Happy to mark this as a duplicate, I got it working from some code in that other question (from @Pault). basically did:

mask = old_db['timevariable'] >= thisdateiwant
mask2 =  old_db['timevariable'] < thisdateiwant (+1day)
new_db = old_db.loc[mask]
new_db = new_db.loc[mask2]

I don't think there's an easy way to do both masks at the same time, seemed to throw and error. It's nice and quick so I'm happy.

Sam
  • 23
  • 6
  • You can chain together multiple boolean expressions like this: `new_df = old_df[(old_df.variable == setvariable) & (old_df.timevariable.date() == thisdateiwant)]`. You'll have to use the bitwise comparison operators (ie `&` instead of `and`) – pault Jun 05 '18 at 13:14
  • Possible duplicate of [Select rows from a DataFrame based on values in a column in pandas](https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas) – pault Jun 05 '18 at 13:15
  • Best thing is to use pandas [query](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html#pandas-dataframe-query) function. – Space Impact Jun 05 '18 at 13:16
  • Your question seems to be asking two different ones: 1) how to filter by date or 2) how to filter across a *bunch of variables*? You could use SQL to query. Please post sample data, more code (not line snippets), and desired results. – Parfait Jun 05 '18 at 13:42
  • Happy to mark this as a duplicate, I got it working from some code in that other question (from Pault). basically did: ' mask = old_db['timevariable'] >= thisdateiwant mask2 = old_db['timevariable'] < thisdateiwant (+1day) new_db = old_db.loc[mask] new_db = new_db.loc[mask2]' – Sam Jun 05 '18 at 14:25

1 Answers1

0

if your column is truly a timestamp then you can make use of the dt accessor.

new_df = old_df[old_df.timevariable.dt.floor('D') == '2018-05-09']

otherwise change the target column to timestamp using pd.to_datetime

old_df[timevariable] = pd.to_datetime(old_df[timevariable])

ranges of dates are supported more naturally without the dt accessor

new_df = old_df[old_df.timevariable >= '2018-05-09'] <- dates after may 9th inclusive
Roelant
  • 4,508
  • 1
  • 32
  • 62
Darren Brien
  • 115
  • 7