0

What is a the more efficient way to bin the amount column into different bucket and get the length of each bucket.

buckets are amount

1. amount < 10000
2. amount >=10000 & <100000
3. amount >100000 & <500000
4. amount > 500000

I was trying to implement the above question using:

sample_data = 
       date    amount type
0   2018-09-28  4000.0  D
1   2018-11-23  2000.0  D
2   2018-12-27  52.5    D
3   2018-10-02  20000.0 D
4   2018-11-27  4000.0  C
5   2018-06-01  500.0   D
6   2018-07-02  5000.0  D
7   2018-07-02  52.5    D
8   2018-10-31  500.0   D
9   2018-11-26  2000.0  C

sample_data['Transactions_bin_1'] = sample_data[sample_data.amount < 10000]['amount']
sample_data['Transactions_bin_2'] = sample_data[(sample_data['amount'] >= 10000) & (sample_data['amount'] < 100000)]['amount']
sample_data['Transactions_bin_3'] = sample_data[(sample_data['amount'] >= 100000) & (sample_data['amount'] < 500000)]['amount']
sample_data['Transactions_bin_4'] = sample_data[sample_data.amount > 500000]['amount']

bin_classification =

{
    
    'bin1' : sample_data.Transactions_bin_1.count(),
    'bin2' :  sample_data.Transactions_bin_2.count(),
    'bin3' : sample_data.Transactions_bin_3.count(),
    'bin4' :  sample_data.Transactions_bin_4.count()
}
chuky pedro
  • 756
  • 1
  • 8
  • 26

2 Answers2

4

use pd.cut: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html:

import numpy as np
bins = [-np.inf,10000,100000,500000,np.inf]
labels = ['amount < 10000' ,'amount >=10000 & <100000','amount >=100000 & <500000', 'amount >= 500000']
df['bins'] = pd.cut(df.amount, bins=bins, , labels=labels,  right=False, include_lowest=True)

OUTPUT:

         date   amount type                      bins
0  2018-09-28   4000.0    D            amount < 10000
1  2018-11-23   2000.0    D            amount < 10000
2  2018-12-27     52.5    D            amount < 10000
3  2018-10-02  20000.0    D  amount >=10000 & <100000
4  2018-11-27   4000.0    C            amount < 10000
5  2018-06-01    500.0    D            amount < 10000
6  2018-07-02   5000.0    D            amount < 10000
7  2018-07-02     52.5    D            amount < 10000
8  2018-10-31    500.0    D            amount < 10000
9  2018-11-26   2000.0    C            amount < 10000

NOTE: you can manipulate the labels list if required.

Nk03
  • 14,699
  • 2
  • 8
  • 22
1

Use pd.cut and pd.IntervalIndex.from_breaks:

>>> pd.cut(df["amount"], closed="left",
           bins=pd.IntervalIndex.from_breaks([0, 10000, 100000, 500000,  np.inf], \
      .value_counts()

[0.0, 10000.0)          9
[10000.0, 100000.0)     1
[100000.0, 500000.0)    0
[500000.0, inf)         0
Name: amount, dtype: int64
Corralien
  • 109,409
  • 8
  • 28
  • 52