I have following data (18,619,211 rows) stored as a pandas dataframe object in hdf5 file:
date id2 w
id
100010 1980-03-31 10401 0.000839
100010 1980-03-31 10604 0.020140
100010 1980-03-31 12490 0.026149
100010 1980-03-31 13047 0.033560
100010 1980-03-31 13303 0.001657
where id
is index and others are columns. date
is np.datetime64
. I need to perform query like this (the code doesn't work of course):
db=pd.HDFStore('database.h5')
data=db.select('df', where='id==id_i & date>bgdt & date<endt')
Note id_i, bgdt, endt
are all variables, not actual values and need to be passed within a loop. for example:
dates
is a Pandas Period index or timestamps index, either way, I can convert to each other.
dates=['1990-01', 1990-04','1990-09',......]
id_list
is a list of IDs
id_list=[100010, 100011,1000012,.......]
The loop looks like this (the reason I am doing loop is because the data is huge, there are other datasets I have query in the same time and then perform some operations)
db=pd.HDFStore('database.h5')
for id_i in id_list:
for date in dates:
bgdt=date-1 (move to previous month)
endt=date-60 (previous 60 month)
data=db.select('df', where='index==id_i & date>bgdt & date<endt')
......
This problem have 2 parts:
- I don't know how to query index and columns in the same time. The doc in pandas showed how to query based on index conditions OR columns conditions, but no examples on how to query based on them in the SAME TIME.
- (BTW, This is very common in Pandas Documentation. The doc usually shows very simple thing like how to do 'A', OR how to do 'B', but not how to do BOTH 'A' and 'B'. A good example is use
query
on a MultiIndex pandas dataframe. The doc shows based on eitherlevel=0
ORlevel=1
, but no example on how to do BOTH in the SAME TIME.)
- (BTW, This is very common in Pandas Documentation. The doc usually shows very simple thing like how to do 'A', OR how to do 'B', but not how to do BOTH 'A' and 'B'. A good example is use
- I don't know how to pass three variables
id_i, bgdt, endt
to the query. I know how to pass only on by using%s
, but not all of them.- I am also a little confused with the datetime datatype. There seems to be quite a few of datetimes:
datetime.datetime
,numpy.datetime64
,pandas.Period
. I am mostly working on monthly data, sopandas.Period
is the most useful one. But I can't easily convert a column (not index) of timestamps (the default date type of Pandas when parsed from raw data). Is there any datatype that is simply a 'date', not timestamps, not period, but just a simple DATE with only year,month and day?
- I am also a little confused with the datetime datatype. There seems to be quite a few of datetimes:
A lot troubles, but I really LOVE python and pandas (I am trying to move my workflow from SAS to Python). Any help will be appreciated!