0

I have a simple Pandas DataFrame containing columns 'valid_time' and 'value'. The frequency of the sampling is roughly hourly, but irregular and with some large gaps. I want to be able to efficiently pull out all rows for a given day (i.e. within a calender day). How can I do this using DataFrame.where() or something else?

I naively want to do something like this (which obviously doesn't work):

dt = datetime.datetime(<someday>)
rows = data.where( data['valid_time'].year == dt.year and 
                    data['valid_time'].day == dt.day and 
                    data['valid_time'].month == dt.month)

There's at least a few problems with the above code. I am new to pandas so am fumbling with something that is probably straightforward.

Alvaro Fuentes
  • 16,937
  • 4
  • 56
  • 68
Bogdanovist
  • 1,498
  • 2
  • 11
  • 20

2 Answers2

2

Pandas is absolutely terrific for things like this. I would recommend making your datetime field your index as can be seen here. If you give a little bit more information about the structure of your dataframe, I would be happy to include more detailed directions.

Then, you can easily grab all rows from a date using df['1-12-2014'] which would grab everything from Jan 12, 2014. You can edit that to get everything from January by using df[1-2014]. If you want to grab data from a range of dates and/or times, you can do something like:

df['1-2014':'2-2014']

Pandas is pretty powerful, especially for time-indexed data.

David Hagan
  • 1,156
  • 12
  • 23
  • Thanks. My dataframe was made simply by converting a numpy 2D array as I had trouble working out the syntax for making a 'blank' DataFrame that I filled with the required info. I'm familiar with numpy so that seemed the quickest solution. I will look at setting the time field as the index though, as that functionality you demonstrate looks like exactly what I need for this project. – Bogdanovist Feb 02 '14 at 23:49
  • From what is your 2D array being generated from? Read in from file? Pulled from DB? Calculations? – David Hagan Feb 03 '14 at 12:18
  • Calculations (mainly), I'm now initialising like this: (for example) hourly_pred = pd.DataFrame({'T':np.zeros(len(fpred)),'W':np.zeros(len(fpred))}, index=fpred.index) – Bogdanovist Feb 04 '14 at 00:53
  • You should be able to use index=pd.to_datetime(fpred.index) – David Hagan Feb 04 '14 at 12:36
0

Try this (is just like the continuation of your idea):

import pandas as pd
import numpy.random as rd
import datetime

times =  pd.date_range('2014/01/01','2014/01/6',freq='H')
values = rd.random_integers(0,10,times.size)

data = pd.DataFrame({'valid_time':times, 'values': values})
dt = datetime.datetime(2014,1,3)
rows = data['valid_time'].apply(
    lambda x: x.year == dt.year and x.month==dt.month and x.day== dt.day 
    )

print data[rows]
Alvaro Fuentes
  • 16,937
  • 4
  • 56
  • 68