I have a rather large pandas dataframe (~1.5M rows, but only 6 columns), that contains similarly structured data: one "Date" dimension (roughly 2200 dates), one "Tenor" dimension (9 tenors) and one "id" dimension (roughly 75 ids), plus 3 "value" columns. Missing 3-tuples are filled with nans, so that the ~1.5M rows come exactly from 2200 x 75 x 9. To exemplify, when selecting the first date and the first 2 ids it looks like the following:
Tenor Date id v1 v2 v3
0 1W 2008-03-27 id1 NaN NaN NaN
1 1M 2008-03-27 id1 0.2546 1.1022 13.5355
2 2M 2008-03-27 id1 0.4634 1.4812 13.4620
3 3M 2008-03-27 id1 0.6346 1.7242 13.4069
4 6M 2008-03-27 id1 1.0010 2.1278 13.3204
5 9M 2008-03-27 id1 NaN NaN NaN
6 12M 2008-03-27 id1 1.2576 2.3298 13.3668
7 18M 2008-03-27 id1 NaN NaN NaN
8 24M 2008-03-27 id1 1.3658 2.6527 13.6518
9 1W 2008-03-27 id2 NaN NaN NaN
10 1M 2008-03-27 id2 0.8800 -1.0000 12.7500
11 2M 2008-03-27 id2 1.0340 -1.0000 12.7500
12 3M 2008-03-27 id2 1.2050 -1.0000 12.7500
13 6M 2008-03-27 id2 1.4670 -1.0000 12.7500
14 9M 2008-03-27 id2 NaN NaN NaN
15 12M 2008-03-27 id2 1.7230 -1.0000 12.7500
16 18M 2008-03-27 id2 NaN NaN NaN
17 24M 2008-03-27 id2 1.7075 -1.6000 12.7500
I need to access each of the ~ 165k sub-dataframes of length 9 many times in an efficient way (by picking exactly one date and one id). Standard slicing as suggested e.g. in Splitting dataframe into multiple dataframes are not going to make it: each slicing will have to look into the whole dataframe, typically resulting in linear access time and O(n^2) operations if I wish to look through all sub-dataframes. Isolating a 9x3 sub-dataframe in this way, e.g.:
from dateutil import parser
d = parser.parse('20150101')
%timeit df[np.logical_and(df.id=='id1', df.Date==d)
takes around 120ms, that is unacceptable in my situation.
After sorting the dataframe by the three interesting columns (df.sort_values(['Date','id','Tenor'])
) I can use the indexing, by exploiting the high regularity of the dataframe:
ids= pd.unique(df.id)
starting = {
p: {d: j*9*len(ids) + i*9 for j,d in enumerate(pd.to_datetime(dates))}
for i, p in enumerate(list(ids))
}
then I can access it by slicing, that cuts the need to look through the whole dataframe and improves efficiency significantly:
s = starting['id1'][parser.parse('20150101')]
%timeit df.loc[s:s+8]
This runs in about 200 µs, some 500 times faster.
However, this is far from the best one can do. If I try to bypass pandas as much as possible, I can split the dataframe into ~165k tiny dataframes, put them in a dictionary, and access them easily:
dates = pd.unique(df.Date)
dfs = {p: {} for p in set(df.id.values)}
for p in ids:
dfs[p] = {d: df.loc[starting[p][d]:starting[p][d]+8,:] for d in pd.to_datetime(dates)}
This doubles (at least) the memory used, but
d = parser.parse('20150101')
%timeit dfs['id1'][d]
gives a much nicer result around 1µs, another ~200x increase over the above, and likely as low as we can go (already %timeit starting['id1'][d]
takes around 1µs).
However, this feels bad, not only for the additional memory use, but especially because it requires me to create a lot of dataframes by hand, and behind the scenes this calls 165k times for malloc. This is a huge waste, as we know in advance how much memory we need (indeed we have it already allocated!), although still it's the best option in my case, as the pre-computation time is dwarfed by the amount of times I'll need to access these sub-dataframes. Furthermore, it could be generalized to uneven division of dataframes in less ideal situations where e.g. the amount of tenors is dependent on the id or date. However, this looks like something pandas should do for me more efficiently than I can.
To sum up, the questions would be:
Is there a pandas function that allows me to split a dataframe to a dictionary, the keys of which are the values of some column and the values of which are slices of the original dataframe?
Is there a reason why cutting the dataframe beforehand is 200 times faster than directly accessing it using the indices? Since the indices I'm using are just integers from 0 to len(df), I'd expect it to be as fast as accessing any Python data structure, yet dictionaries seem to win hands-down.
Otherwise and more generally, is there a faster way to access large dataframes that escapes me, in case where we know already which line we need?