2

So i have a dataset about customers in a store and the sales of that store on each day.

Which looks something like this -

Store ID     Sales      Customers
1           250        500
2           276        786
3           124        256
5           164        925

How do i convert it to grouped data, something like this

Sales           Customers
0-100           0
100-200         1181
200-300         1286

I have searched for a while and found this from the pandas site - http://pandas.pydata.org/pandas-docs/version/0.15.2/groupby.html

df2.groupby(['X'], sort=True).sum()

But i am unable to understand how to apply the same to my example.

3 Answers3

1

Use pd.cut with cusom bins

In [2058]: df.groupby(pd.cut(df.Sales, [0, 100, 200, 300])).Customers.sum().fillna(0)
Out[2058]:
Sales
(0, 100]         0.0
(100, 200]    1181.0
(200, 300]    1286.0
Name: Customers, dtype: float64

And reset_index() to reshape

In [2061]: (df.groupby(pd.cut(df.Sales, [0, 100, 200, 300])).Customers.sum()
              .fillna(0).reset_index())
Out[2061]:
        Sales  Customers
0    (0, 100]        0.0
1  (100, 200]     1181.0
2  (200, 300]     1286.0

And, use labels for custom names

In [2062]: (df.groupby(pd.cut(df.Sales, 
               bins=[0, 100, 200, 300], 
               labels=['0-100', '100-200', '200-300']))
         .Customers.sum().fillna(0).reset_index())
Out[2062]:
     Sales  Customers
0    0-100        0.0
1  100-200     1181.0
2  200-300     1286.0
Zero
  • 74,117
  • 18
  • 147
  • 154
1

Use cut for bins and then groupby and aggregate sum:

df = df.groupby(pd.cut(df['Sales'], [0,100,200,300]))['Customers'].sum().fillna(0)
print (df)
Sales
(0, 100]         0.0
(100, 200]    1181.0
(200, 300]    1286.0
Name: Customers, dtype: float64

Also is possible define labels:

l =['0-100','100-200','200-300']
b = [0,100,200,300]
df = df.groupby(pd.cut(df['Sales'], bins=b, labels=l))['Customers'].sum()
       .fillna(0)    
       .reset_index()
print (df)
     Sales  Customers
0    0-100        0.0
1  100-200     1181.0
2  200-300     1286.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

with np.searchsorted

labels = {0: '0-100', 1: '100-200', 2: '200-300'}
s = np.searchsorted([100, 200], df.Sales)
df.groupby(s).Customers.sum().reindex(
    pd.Index([0, 1, 2], name='Sales'), fill_value=0
).rename(labels).reset_index()

     Sales  Customers
0    0-100          0
1  100-200       1181
2  200-300       1286
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Can I ask something? I use your solution for timings and it seems wrong output [here](https://stackoverflow.com/a/46091980/2901002). And I have no idea what is wrong. Can you check it? – jezrael Sep 12 '17 at 07:14
  • Your functions don't accept a parameter. Therefore `data` is always the same. Use this for `stmt` parameter: `'{}(df)'.format(j)`. And place `data` in the signature of each function you define. `dropna(data)` and `notnull(data)` and `query(data)` – piRSquared Sep 12 '17 at 07:20