2

I've a question about rebinning a list of numbers, with a desired bin-width. It's basically what a frequency histogram does, but I don't want the plot, just the bin number and the number of occurrences for each bin.

So far I've already written some code that does what I want, but it's not very efficient. Given a list a, in order to rebin it with a bin-width equal to 3, I've written the following:

import os, sys, math
import numpy as np

# list of numbers
a = list(range(3000))

# number of entries
L = int(len(a))

# desired bin width
W = 3

# number of bins with width W
N = int(L/W)

# definition of new empty array
a_rebin = np.zeros((N, 2))

# cycles to populate the new rebinned array
for n in range(0,N):
    k = 0
    for i in range(0,L):
        if a[i] >= (W*n) and a[i] < (W+W*n):
            k = k+1
    a_rebin[n]=[W*n,k]

# print
print a_rebin

Now, this does exactly what I want, but I think it's not so smart, as it reads the whole list N times, with N number of bins. It's fine for small lists. But, as I have to deal with very large lists and rather small bin-widths, this translates into huge values of N and the whole process takes a very long time (hours...). Do you have any ideas to improve this code? Thank you in advance!

urgeo
  • 645
  • 1
  • 9
  • 19

2 Answers2

4

If you use a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], your solution is:

[[ 0. 3.]
[ 3. 3.]
[ 6. 3.]]

How you interpret this? The intervals are 0..2, 3..5, 6..8? I think you are missing something.

Using numpy.histogram()

hist, bin_edges = numpy.histogram(a, bins=int(len(a)/W))
print(hist)
print(bin_edges)

Output:

[3 3 4]
[ 0. 3. 6. 9.]

We have 4 values in bin_edges: 0, 3, 6 and 9. All but the last (righthand-most) bin is half-open. It means we have 3 intervals [0,3), [3,6) and [6,9] and we have 3, 3 and 4 elements in each bin.
You can define your own bins.

import numpy
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
bins=[0,1,2]
hist, bin_edges = numpy.histogram(a, bins=bins)
print(hist)
print(bin_edges)

Output:

[1 2]
[0 1 2]

Now you have 1 element in [0 ,1) and 2 elements in [1,2].

Jose Raul Barreras
  • 849
  • 1
  • 13
  • 19
  • Mmm yes, my algorithm misses the last bin, but your solution merges together the last two bins, or so it seems. For [0,1,2,3,4,5,6,7,8,9] with a binwidth 3, I expect occurrences as [3,3,3,1], but you get [3,3,4]. If I choose a bindwith 5 I expect occurrences as [5,5], but this code gives me bin edges that I don't understand, [ 0. 4.5 9. ]... Sorry, I'm not very used to python... – urgeo Aug 05 '16 at 10:52
  • We have 4 values in bin_edges: 0, 3, 6 and 9. All but the last (righthand-most) bin is half-open. It means we have 3 intervals [0,3), [3,6) and [6,9] and we have 3, 3 and 4 elements in each bin. You can define your own bins: [0,1,2] and now you have 1 element in [0 ,1) and 2 elements in [1,2]. OK now? – Jose Raul Barreras Aug 05 '16 at 14:45
-1

Numpy has a method called np.histogram which does the work for you. It also scales pretty well.

Naveen Arun
  • 329
  • 1
  • 4