24

Say I have a huge list of numbers between 0 and 100. I compute ranges, depending on the max number and then saying there are 10 bins. So my ranges are for example

ranges = [0,10,20,30,40,50,60,70,80,90,100]

Now I count the occurances in each range from 0-10, 10-20, and so on. I iterate over every number in the list and check for a range. I assume this is not the best way in terms of runtime speed.

Can I fasten it up by using pandas, e.g. pandas.groupby, and how?

user2366975
  • 4,350
  • 9
  • 47
  • 87

2 Answers2

40

We can use pd.cut to bin the values into ranges, then we can groupby these ranges, and finally call count to count the values now binned into these ranges:

np.random.seed(0)
df = pd.DataFrame({"a": np.random.random_integers(1, high=100, size=100)})
ranges = [0,10,20,30,40,50,60,70,80,90,100]
df.groupby(pd.cut(df.a, ranges)).count()

            a
a            
(0, 10]    11
(10, 20]   10
(20, 30]    8
(30, 40]   13
(40, 50]   11
(50, 60]    9
(60, 70]   10
(70, 80]   11
(80, 90]   13
(90, 100]   4
cs95
  • 379,657
  • 97
  • 704
  • 746
EdChum
  • 376,765
  • 198
  • 813
  • 562
35

Surprised I haven't seen this yet, so without further ado, here is

.value_counts(bins=N)

Computing bins with pd.cut followed by a groupBy is a 2-step process. value_counts allows you a shortcut using the bins argument:

# Uses Ed Chum's setup. Cross check our answers match!
np.random.seed(0)
df = pd.DataFrame({"a": np.random.random_integers(1, high=100, size=100)})

df['a'].value_counts(bins=10, sort=False)

(0.9, 10.9]      11
(10.9, 20.8]     10
(20.8, 30.7]      8
(30.7, 40.6]     13
(40.6, 50.5]     11
(50.5, 60.4]      9
(60.4, 70.3]     10
(70.3, 80.2]     11
(80.2, 90.1]     13
(90.1, 100.0]     4
Name: a, dtype: int64

This creates 10 evenly-spaced right-closed intervals and bincounts your data. sort=False will be required to avoid value_counts ordering the result in decreasing order of count.


Binning by Unequal Ranges

For this, you can pass a list to bins argument:

bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
df['a'].value_counts(bins=bins, sort=False) 

(-0.001, 10.0]    11
(10.0, 20.0]      10
(20.0, 30.0]       8
(30.0, 40.0]      13
(40.0, 50.0]      11
(50.0, 60.0]       9
(60.0, 70.0]      10
(70.0, 80.0]      11
(80.0, 90.0]      13
(90.0, 100.0]      4
Name: a, dtype: int64
cs95
  • 379,657
  • 97
  • 704
  • 746