3

I have a rather large pandas dataframe (~1.5M rows, but only 6 columns), that contains similarly structured data: one "Date" dimension (roughly 2200 dates), one "Tenor" dimension (9 tenors) and one "id" dimension (roughly 75 ids), plus 3 "value" columns. Missing 3-tuples are filled with nans, so that the ~1.5M rows come exactly from 2200 x 75 x 9. To exemplify, when selecting the first date and the first 2 ids it looks like the following:

    Tenor   Date    id  v1  v2  v3
0   1W  2008-03-27  id1 NaN NaN NaN
1   1M  2008-03-27  id1 0.2546  1.1022  13.5355
2   2M  2008-03-27  id1 0.4634  1.4812  13.4620
3   3M  2008-03-27  id1 0.6346  1.7242  13.4069
4   6M  2008-03-27  id1 1.0010  2.1278  13.3204
5   9M  2008-03-27  id1 NaN NaN NaN
6   12M 2008-03-27  id1 1.2576  2.3298  13.3668
7   18M 2008-03-27  id1 NaN NaN NaN
8   24M 2008-03-27  id1 1.3658  2.6527  13.6518
9   1W  2008-03-27  id2 NaN NaN NaN
10  1M  2008-03-27  id2 0.8800  -1.0000 12.7500
11  2M  2008-03-27  id2 1.0340  -1.0000 12.7500
12  3M  2008-03-27  id2 1.2050  -1.0000 12.7500
13  6M  2008-03-27  id2 1.4670  -1.0000 12.7500
14  9M  2008-03-27  id2 NaN NaN NaN
15  12M 2008-03-27  id2 1.7230  -1.0000 12.7500
16  18M 2008-03-27  id2 NaN NaN NaN
17  24M 2008-03-27  id2 1.7075  -1.6000 12.7500

I need to access each of the ~ 165k sub-dataframes of length 9 many times in an efficient way (by picking exactly one date and one id). Standard slicing as suggested e.g. in Splitting dataframe into multiple dataframes are not going to make it: each slicing will have to look into the whole dataframe, typically resulting in linear access time and O(n^2) operations if I wish to look through all sub-dataframes. Isolating a 9x3 sub-dataframe in this way, e.g.:

from dateutil import parser
d = parser.parse('20150101')
%timeit df[np.logical_and(df.id=='id1', df.Date==d)

takes around 120ms, that is unacceptable in my situation.

After sorting the dataframe by the three interesting columns (df.sort_values(['Date','id','Tenor'])) I can use the indexing, by exploiting the high regularity of the dataframe:

ids= pd.unique(df.id)
starting = {
    p: {d: j*9*len(ids) + i*9 for j,d in enumerate(pd.to_datetime(dates))}
    for i, p in enumerate(list(ids))
}

then I can access it by slicing, that cuts the need to look through the whole dataframe and improves efficiency significantly:

s = starting['id1'][parser.parse('20150101')]
%timeit df.loc[s:s+8]

This runs in about 200 µs, some 500 times faster.

However, this is far from the best one can do. If I try to bypass pandas as much as possible, I can split the dataframe into ~165k tiny dataframes, put them in a dictionary, and access them easily:

dates = pd.unique(df.Date)
dfs = {p: {} for p in set(df.id.values)}
for p in ids:
    dfs[p] = {d: df.loc[starting[p][d]:starting[p][d]+8,:] for d in pd.to_datetime(dates)}

This doubles (at least) the memory used, but

d = parser.parse('20150101')
%timeit dfs['id1'][d]

gives a much nicer result around 1µs, another ~200x increase over the above, and likely as low as we can go (already %timeit starting['id1'][d] takes around 1µs).

However, this feels bad, not only for the additional memory use, but especially because it requires me to create a lot of dataframes by hand, and behind the scenes this calls 165k times for malloc. This is a huge waste, as we know in advance how much memory we need (indeed we have it already allocated!), although still it's the best option in my case, as the pre-computation time is dwarfed by the amount of times I'll need to access these sub-dataframes. Furthermore, it could be generalized to uneven division of dataframes in less ideal situations where e.g. the amount of tenors is dependent on the id or date. However, this looks like something pandas should do for me more efficiently than I can.

To sum up, the questions would be:

  1. Is there a pandas function that allows me to split a dataframe to a dictionary, the keys of which are the values of some column and the values of which are slices of the original dataframe?

  2. Is there a reason why cutting the dataframe beforehand is 200 times faster than directly accessing it using the indices? Since the indices I'm using are just integers from 0 to len(df), I'd expect it to be as fast as accessing any Python data structure, yet dictionaries seem to win hands-down.

  3. Otherwise and more generally, is there a faster way to access large dataframes that escapes me, in case where we know already which line we need?

Community
  • 1
  • 1
Marco Spinaci
  • 1,750
  • 15
  • 22
  • Do you *need* to use pandas throughout this entire process? Splitting could be faster if you leave pandas till the end when you actually need to use utilities from pandas. – Jon Apr 24 '17 at 16:15
  • 1
    For your second question, reading in your data as straight strings (i.e. text) and putting it in dictionaries will be more efficient because there is less code to manage in the background. Dataframes are bulky things. They are useful to working with *immediately needed* data but they chew up resources if you have a large amount of data. Therefore, you should really only use the data you currently need, not everything you have. pandas Dataframes shouldn't be used as a database. – Jon Apr 24 '17 at 16:26
  • Have you tried indexing your DataFrame by the fields you want to access? Something like `df.set_index(['id', 'Date'])` and `df.loc[('id1', '2015-01-01')]`? You should get better performance by properly using an index. – root Apr 24 '17 at 16:49
  • @root thanks for the hint, I actually though about it but not implemented because I thought it cannot be faster than setting indices to number, precomputing the indices and accessing them (i.e. the second solution). I now quickly tested it and indeed around 100x slower (albeit cleaner), it gives around 16ms. – Marco Spinaci Apr 25 '17 at 08:06
  • @Jon thanks a lot for the useful comment! Indeed I misinterpreted the rationale here, I thought somehow dataframes were more optimized than the trivial code I could write, but probably I'm wrong. I don't need dataframes at all past the first reading step, so I'll just convert them to more basic structures. I just hoped there would be a built-in function to do the splitting efficiently (as I'd imagine how to do that efficiently in C but creating many structures by hand in Python is doomed to be slower). Thanks anyway for the clarification! – Marco Spinaci Apr 25 '17 at 08:15

0 Answers0