0

I have a dataframe where I want to see the zero counts. It's essentially a duplicate of this question: Pandas Groupby How to Show Zero Counts in DataFrame

But unfortunately the answer is not a duplicate one. Whenever I try the MultiIndex.from_product approach, I get the error:

ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.

which is because I have several unique values for the groupby. I've confirmed though that the same script works for much smaller dataframes with fewer unique indices (and therefore, fewer elements in df.index.levels[i].values).

Here's an idea on the dataframe that I'm working with:

user1   user2   hour
-------------------
Alice   Bob     0
Alice   Carol   1
Alice   Bob     13
Bob     Eve     2

to

user1   user2   hour   count
-------------------------------
Alice   Bob     0        1
Alice   Bob     1        0
Alice   Bob     2        0

and so on but what I get is

user1   user2   hour   count
-------------------------------
Alice   Bob     0        1
Alice   Bob     13       1
Alice   Carol   1        1

However, I have ~1.2M unique combinations of user1-user2, so MultiIndex.from_product doesn't work.

EDIT: Here's the code I used for some dummy dataframe. It works for the dummy case, but not for the larger case:

import pandas as pd

df = pd.DataFrame({'id':[1,1,2,2,3,3],'hour':[0,1,0,0,1,1], 'to_count': [20,10,5,4,17,6]})
print(df)

agg_df = df.groupby(['id', 'hour']).agg({'to_count': 'count'})
print(df.groupby(['id', 'hour']).agg({'to_count':'count'}))

print(len(agg_df.index.levels))
levels = [agg_df.index.levels[i].values for i in range(len(agg_df.index.levels))]
levels[-1] = [0,1,2]
print(len(levels))
print(agg_df.index.names)
new_index = pd.MultiIndex.from_product(levels, names=agg_df.index.names)
# Reindex the agg_df and fill empty values with zero (NaN by default)
agg_df = agg_df.reindex(new_index, fill_value=0)
# Reset index
agg_df = agg_df.reset_index()

Is there a better way to show zero counts for groupby in large pandas dataframes?

RCA
  • 508
  • 4
  • 12
irene
  • 2,085
  • 1
  • 22
  • 36
  • Can you post some of your code? `MultiIndex.from_product` is a way of creating a dataframe, not grouping. – RCA May 24 '18 at 16:32
  • Hi, @RCA that's what I did, but I get the `ValueError` due to this line `idx = pd.MultiIndex.from_product([g.index.levels[0], ['Y', 'N']])` (copied from the other stackoverflow question). – irene May 24 '18 at 16:32
  • If you could post a little more of what you've done, that'd help in diagnosing the problem. – RCA May 24 '18 at 16:33
  • Hi @RCA I posted the code already for some dummy example. The dataframe is a dummy one, but the code is the exact one I used for the large dataframe. Thanks! – irene May 24 '18 at 16:34
  • Have you looked at [this question](https://stackoverflow.com/questions/21666976/valueerror-array-is-too-big-cannot-understand-how-to-fix-this)? I think that might point you in the right direction. – RCA May 24 '18 at 16:42
  • Hmm so it's a memory issue...I see. Thanks for the link. I'll see what I can do about this. – irene May 24 '18 at 16:47

0 Answers0