1

Pandas "Group By" Query on Large Data in HDFStore?

I have tried the example in the answer except that I would like to be able to group by two columns.

Basically, modifying the code to look like

with pd.get_store(fname) as store:
    store.append('df',df,data_columns=['A','B','C'])
    print "store:\n%s" % store

    print "\ndf:\n%s" % store['df']

   # get the groups
   groups = store.select_column('df',['A', 'B']).unique()
   print "\ngroups:%s" % groups

I have tried multiple ways of selecting columns A and B and could not get it to work.

Error is thrown KeyError: "column [['A', 'B']] not found in the table"

Is this supported?

Thanks

Community
  • 1
  • 1

1 Answers1

2

store.select_column(...) only selects a SINGLE column.

Slightly modifying that original code that is linked:

import numpy as np
import pandas as pd
import os

fname = 'groupby.h5'

# create a frame
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'foo',
                         'bar', 'bar', 'bar', 'bar',
                         'foo', 'foo', 'foo'],
                   'B': [1,1,1,2,
                         1,1,1,2,
                         2,2,1],
                   'C': ['dull', 'dull', 'shiny', 'dull',
                         'dull', 'shiny', 'shiny', 'dull',
                         'shiny', 'shiny', 'shiny'],
                   'D': np.random.randn(11),
                   'E': np.random.randn(11),
                   'F': np.random.randn(11)})


# create the store and append, using data_columns where I possibily
# could aggregate
with pd.get_store(fname,mode='w') as store:
    store.append('df',df,data_columns=['A','B','C'])

    print "\ndf:\n%s" % store['df']

    # get the groups
    A = store.select_column('df','A')
    B = store.select_column('df','B')
    idx = pd.MultiIndex.from_arrays([A,B])
    groups = idx.unique()

    # iterate over the groups and apply my operations
    l = []
    for (a,b) in groups:

        grp = store.select('df',where = [ 'A=%s and B=%s' % (a,b) ])

        # this is a regular frame, aggregate however you would like
        l.append(grp[['D','E','F']].sum())

print "\nresult:\n%s" % pd.concat(l, keys = groups)

os.remove(fname)

Here are the results

The starting frame (differs from the original example as B column is now integers, just for clarity)

df:
      A  B      C         D         E         F
0   foo  1   dull  0.993672 -0.889936  0.300826
1   foo  1   dull -0.708760 -1.121964 -1.339494
2   foo  1  shiny -0.606585 -0.345783  0.734747
3   foo  2   dull -0.818121 -0.187682 -0.258820
4   bar  1   dull -0.612097 -0.588711  1.417523
5   bar  1  shiny -0.591513  0.661931  0.337610
6   bar  1  shiny -0.974495  0.347694 -1.100550
7   bar  2   dull  1.888711  1.824509 -0.635721
8   foo  2  shiny  0.715446 -0.540150  0.789633
9   foo  2  shiny -0.262954  0.957464 -0.042694
10  foo  1  shiny  0.193822 -0.241079 -0.478291

The unique groups. We selected each column that needs grouping independently, then taking the resulting indexes and construct a multi-index. These are the unique groups of the resulting multi-index.

groups:[('foo', 1) ('foo', 2) ('bar', 1) ('bar', 2)]

The final result.

result:
foo  1  D   -0.127852
        E   -2.598762
        F   -0.782213
     2  D   -0.365629
        E    0.229632
        F    0.488119
bar  1  D   -2.178105
        E    0.420914
        F    0.654583
     2  D    1.888711
        E    1.824509
        F   -0.635721
dtype: float64
Jeff
  • 125,376
  • 21
  • 220
  • 187