Pandas "Group By" Query on Large Data in HDFStore?

Question

I have about 7 million rows in an HDFStore with more than 60 columns. The data is more than I can fit into memory. I'm looking to aggregate the data into groups based on the value of a column "A". The documentation for pandas splitting/aggregating/combining assumes that I have all my data in a DataFrame already, however I can't read the entire store into an in-memory DataFrame. What is the correct approach for grouping data in an HDFStore?

Have you looked at http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas/14287518#14287518? Jeff's answers are a good primer for this type of workflow — Chang She, Apr 03 '13 at 21:29
This is currently being discussed as a future enhancement for pandas. I sure hope it is added one day as I could finally ditch SAS: https://github.com/pydata/pandas/issues/3202 — Zelazny7, Apr 03 '13 at 21:29
As I understand the documentation, HDFStores don't support all operations available on DataFrames. You could try using the table querying abilities described [here](http://pandas.pydata.org/pandas-docs/dev/io.html#query-via-data-columns) to manually select one group at a time. — BrenBarn, Apr 03 '13 at 21:29
can u give a small example of what kind of functions u r going to apply with the group? and a small example frame would be helpful. — Jeff, Apr 03 '13 at 21:43
also pls show df.get_dtype_counts() and whether u use data_columns. thxs — Jeff, Apr 03 '13 at 21:44
one more item would be helpful: what kind of group density are you expecting? e.g. 100 groups, 10,000, or 1M (or more)? is the groupby over multiple columns, what are the types of those columns? — Jeff, Apr 03 '13 at 21:57

Jeff · Accepted Answer · 2014-07-03T00:46:46.740

Heres a complete example.

import numpy as np
import pandas as pd
import os

fname = 'groupby.h5'

# create a frame
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'foo',
                         'bar', 'bar', 'bar', 'bar',
                         'foo', 'foo', 'foo'],
                   'B': ['one', 'one', 'one', 'two',
                         'one', 'one', 'one', 'two',
                         'two', 'two', 'one'],
                   'C': ['dull', 'dull', 'shiny', 'dull',
                         'dull', 'shiny', 'shiny', 'dull',
                         'shiny', 'shiny', 'shiny'],
                   'D': np.random.randn(11),
                   'E': np.random.randn(11),
                   'F': np.random.randn(11)})


# create the store and append, using data_columns where I possibily
# could aggregate
with pd.get_store(fname) as store:
    store.append('df',df,data_columns=['A','B','C'])
    print "store:\n%s" % store

    print "\ndf:\n%s" % store['df']

    # get the groups
    groups = store.select_column('df','A').unique()
    print "\ngroups:%s" % groups

    # iterate over the groups and apply my operations
    l = []
    for g in groups:

        grp = store.select('df',where = [ 'A=%s' % g ])

        # this is a regular frame, aggregate however you would like
        l.append(grp[['D','E','F']].sum())


    print "\nresult:\n%s" % pd.concat(l, keys = groups)

os.remove(fname)

Output

store:
<class 'pandas.io.pytables.HDFStore'>
File path: groupby.h5
/df            frame_table  (typ->appendable,nrows->11,ncols->6,indexers->[index],dc->[A,B,C])

df:
      A    B      C         D         E         F
0   foo  one   dull -0.815212 -1.195488 -1.346980
1   foo  one   dull -1.111686 -1.814385 -0.974327
2   foo  one  shiny -1.069152 -1.926265  0.360318
3   foo  two   dull -0.472180  0.698369 -1.007010
4   bar  one   dull  1.329867  0.709621  1.877898
5   bar  one  shiny -0.962906  0.489594 -0.663068
6   bar  one  shiny -0.657922 -0.377705  0.065790
7   bar  two   dull -0.172245  1.694245  1.374189
8   foo  two  shiny -0.780877 -2.334895 -2.747404
9   foo  two  shiny -0.257413  0.577804 -0.159316
10  foo  one  shiny  0.737597  1.979373 -0.236070

groups:Index([bar, foo], dtype=object)

result:
bar  D   -0.463206
     E    2.515754
     F    2.654810
foo  D   -3.768923
     E   -4.015488
     F   -6.110789
dtype: float64

Some caveats:

1) This methodology makes sense if your group density is relatively low. On the order of hundreds or thousands of groups. If you get more than that there are more efficient (but more complicated methods), and your function which you are applying (in this case sum) become more restrictive.

Essentially you would iterator over the entire store by chunks, grouping as you go, but keeping the groups only semi-collapsed (imagine doing a mean, so you would need to keep a running total plus a running count, then divide at the end). So some operations would be a bit trickier, but could potentially handle MANY groups (and is really fast).

2) the efficiency of this could be improved by saving the coordinates (e.g. the group locations, but this is a bit more complicated)

3) multi-grouping is not possible with this scheme (it IS possible, but requires an approach more like 2) above

4) the columns that you want to group, MUST be a data_column!

5) you can combine any other filter you wish in the select btw (which is a sneeky way of doing multi-grouping btw, you just form 2 unique lists of group and iterator over the product of them, not extremely efficient if you have lots of groups, but can work)

HTH

let me know if this works for you

Thanks Jeff. I'd give you bonus points if I could for adding this example (and a link back to SO!) to the git ticket. — technomalogical, Apr 04 '13 at 17:54
you also made the cookbook (but not updated with this question yet though), see: http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore — Jeff, Apr 04 '13 at 19:06
In later versions of pandas, the line `groups = store.unique('df','A')` should read `groups = store.select_column('df', 'A').unique()`. — IanH, Jul 03 '14 at 00:41
But `store.select_column('df','A').unique()` will trigger a full read, what if `df['A']` is too big to fit into memory? — agemO, Sep 24 '20 at 12:59

Pandas "Group By" Query on Large Data in HDFStore?

1 Answers1

Linked