I have a dataframe where I want to see the zero counts. It's essentially a duplicate of this question: Pandas Groupby How to Show Zero Counts in DataFrame
But unfortunately the answer is not a duplicate one. Whenever I try the MultiIndex.from_product
approach, I get the error:
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.
which is because I have several unique values for the groupby. I've confirmed though that the same script works for much smaller dataframes with fewer unique indices (and therefore, fewer elements in df.index.levels[i].values
).
Here's an idea on the dataframe that I'm working with:
user1 user2 hour
-------------------
Alice Bob 0
Alice Carol 1
Alice Bob 13
Bob Eve 2
to
user1 user2 hour count
-------------------------------
Alice Bob 0 1
Alice Bob 1 0
Alice Bob 2 0
and so on but what I get is
user1 user2 hour count
-------------------------------
Alice Bob 0 1
Alice Bob 13 1
Alice Carol 1 1
However, I have ~1.2M unique combinations of user1-user2
, so MultiIndex.from_product
doesn't work.
EDIT: Here's the code I used for some dummy dataframe. It works for the dummy case, but not for the larger case:
import pandas as pd
df = pd.DataFrame({'id':[1,1,2,2,3,3],'hour':[0,1,0,0,1,1], 'to_count': [20,10,5,4,17,6]})
print(df)
agg_df = df.groupby(['id', 'hour']).agg({'to_count': 'count'})
print(df.groupby(['id', 'hour']).agg({'to_count':'count'}))
print(len(agg_df.index.levels))
levels = [agg_df.index.levels[i].values for i in range(len(agg_df.index.levels))]
levels[-1] = [0,1,2]
print(len(levels))
print(agg_df.index.names)
new_index = pd.MultiIndex.from_product(levels, names=agg_df.index.names)
# Reindex the agg_df and fill empty values with zero (NaN by default)
agg_df = agg_df.reindex(new_index, fill_value=0)
# Reset index
agg_df = agg_df.reset_index()
Is there a better way to show zero counts for groupby in large pandas dataframes?