0

I have to generate a list of random numbers with a gaussian distribution (I'm able to do this) and then take those numbers and plot them in a histogram. My problem is that I'm supposed to do this without using the built-in histogram function within pylab (or any other package for that matter) and I'm at a complete loss. I've been looking on-line and I haven't found anything that explains how I would go about this, does any of you know what I could do? Thanks in advance.

Dora
  • 374
  • 3
  • 8
user3282005
  • 29
  • 2
  • 5
  • 1
    How about bucketing your values and then printing out the count of items in each bucket with asterisks? Like: `for bucket in buckets: print '*' * bucket`. This will give you a graphical representation with the bars displayed horizontally. – hughdbrown Feb 07 '14 at 05:20
  • several answers in http://stackoverflow.com/questions/2870466/python-histogram-one-liner don't use any histogram modules. – Bonlenfum Feb 07 '14 at 10:16
  • @hughdbrown the question appears to be specifically about how to bucket the values and determine the counts. – Karl Knechtel Aug 01 '22 at 21:18

3 Answers3

8

A fast way to compute a histogram is to walk through the list one element at a time, sort out what bin it should be in and then count the number of entries in each bin.

hist_vals = np.zeros(nbins)
for d in data:
    bin_number = int(nbins * ((d - min_val) / (max_val - min_val)))
    hist_vals[bin_number] += 1

Note that this has O(len(data)) with a small pre-factor.

A smarter way to write this is to vectorize the hash function:

bin_number = (nbins * ((data - min_val) / (max_val - min_val))).astype(np.int)

and use slicing magic for the summation:

hist_vals[bin_number] += 1  # numpy slicing magic

If you are concerned about speed, you can use the numpy functions which essentially do this, but put the loops at the c level:

bin_nums = np.digitize(data, bins) - 1
hist_vals = np.bincount(bin_nums)
tacaswell
  • 84,579
  • 22
  • 210
  • 199
  • this might work if your dataset is small. If you're dealing with a large sample (N>1e6), the solution I proposed scales much better. – Brian Feb 10 '14 at 10:36
  • @Matteo Have you actually bench marked this? This should scale better, but might be slower due to doing the loop in python instead of c (the wrong algorithm in c can be faster than the right one in python). – tacaswell Feb 10 '14 at 14:09
  • I'm used to dealing with large amount of data (N>1e8) and I can tell you that if I use the code I suggested I only have to wait for a couple of minutes (at most) to get the answer. Your approach, although certainly smart and useful for small datasets, or when the number of bins is large, can take up to several days. – Brian Feb 10 '14 at 14:52
  • I just ran it on 1e8 data points and it took around a minute for the pure-python version and < 6 seconds for `np.bincount(np.digitize(np.random.rand(1e8), np.linspace(0, 1, 100, endpoint=True), right=True) - 1)` – tacaswell Feb 10 '14 at 15:29
  • @tcaswell: I wonder if `hist_vals = np.zeros(nbins)` should be changed to `hist_vals = np.zeros(nbins+1)`! – Dataman Jun 10 '16 at 09:23
2

Let's assume you have a numpy array that represents your random numbers

        rnd_numb=array([ 0.48942231,  0.48536864,  0.48614467, ...,  0.47264172,
    0.48309697,  0.48439782])

In order to create a histogram you only need to bin your data. So let's create an array that defines the binning

       bin_array=linspace(0,1,100)

In this case we're creating 100 linearly spaced bins in the range 0 to 1

Now, in order to create the histogram you can simply do

  my_histogram=[]
  for i in range(len(bin_array)-1):
     mask = (rnd_numb>=bin_array[i])&(rnd_numb<bin_array[i+1])
     my_histogram.append(len(rnd_numb[mask]))

This creates a list that contains the counts in each bin. Lastly, if you want to visualize your histogram you can do

 plot ((bin_array[1:]+bin_array[:-1])/2.,my_histrogram)

you can also try step or bar.

Brian
  • 13,996
  • 19
  • 70
  • 94
  • This will scale really badly as you touch every element of the data array _many_ times (> nbins * 2). You can compute a histogram in a single pass through the data which – tacaswell Feb 07 '14 at 15:48
2

Here is a version that builds on @tacaswell's solution but that doesn't use numpy.

def histogram(data, nbins, min_val=None, max_val=None):
    hist_vals = [0]*(nbins+1)
    if min_val is None:
        min_val = min(data)
    if max_val is None:
        max_val = max(data)

    for d in data:
        bin_number = int(nbins * ((d - min_val) / (max_val - min_val)))
        hist_vals[bin_number] += 1
    bin_lower_bounds = [min_val + i*(max_val - min_val)/len(hist_vals) for i in range(len(hist_vals))]
    return hist_vals, bin_lower_bounds
bbrame
  • 18,031
  • 10
  • 35
  • 52