25

I have list of integers and want to get frequency of each integer. This was discussed here

The problem is that approach I'm using gives me frequency of floating numbers when my data set consist of integers only. Why that happens and how I can get frequency of integers from my data?

I'm using pyplot.histogram to plot a histogram with frequency of occurrences

import numpy as np
import matplotlib.pyplot as plt
from numpy import *
data = loadtxt('data.txt',dtype=int,usecols=(4,)) #loading 5th column of csv file into array named data. 
plt.hist(data) #plotting the column as histogram 

I'm getting the histogram, but I've noticed that if I "print" hist(data)

hist=np.histogram(data)
print hist(data)

I get this:

(array([ 2323, 16338,  1587,   212,    26,    14,     3,     2,     2,     2]), 
array([  1. ,   2.8,   4.6,   6.4,   8.2,  10. ,  11.8,  13.6,  15.4,
    17.2,  19. ]))

Where the second array represent values and first array represent number of occurrences.

In my data set all values are integers, how that happens that second array have floating numbers and how should I get frequency of integers?

UPDATE:

This solves the problem, thank you Lev for the reply.

plt.hist(data, bins=np.arange(data.min(), data.max()+1))

To avoid creating a new question how I can plot columns "in the middle" for each integer? Say, I want column for integer 3 take space between 2.5 and 3.5 not between 3 and 4.

histogram

Community
  • 1
  • 1
user40
  • 1,361
  • 5
  • 19
  • 34
  • 1
    Are you sure you are using the data you think you are? Your comment says the 4th column, but indexing starts at 0 so column 4 is actually the 5th column. – djhoese Mar 02 '14 at 12:59
  • yes it's fifth column, typo. – user40 Mar 02 '14 at 13:00
  • I guess it should be `data.max() + 2`. `np.arange` is without the upper border and `bins` contains the range (elements from 0-1, 1-2, ...) – Martin Thoma Mar 09 '17 at 16:17

3 Answers3

23

If you don't specify what bins to use, np.histogram and pyplot.hist will use a default setting, which is to use 10 equal bins. The left border of the 1st bin is the smallest value and the right border of the last bin is the largest.

This is why the bin borders are floating point numbers. You can use the bins keyword arguments to enforce another choice of bins, e.g.:

plt.hist(data, bins=np.arange(data.min(), data.max()+1))

Edit: the easiest way to shift all bins to the left is probably just to subtract 0.5 from all bin borders:

plt.hist(data, bins=np.arange(data.min(), data.max()+1)-0.5)

Another way to achieve the same effect (not equivalent if non-integers are present):

plt.hist(data, bins=np.arange(data.min(), data.max()+1), align='left')
Lev Levitsky
  • 63,701
  • 20
  • 147
  • 175
  • So if I know the exact number of different values I can just put the value in parentheses? And if I don't know, then what you suggested. I will try. – user40 Mar 02 '14 at 13:02
  • 1
    @user40 Yes, you can specify any sequence, but keep in mind that it's the _borders_ you supply, so there's n+1 of them for n bins. Also, there can't be "space" between bins AFAIK, although you can make it look like there is some by specifying the bin widths. Edit: I just realized you said something different. Yes, you can just specify a number, like 10. That would mean the number of equally-sized bins, from min to max. – Lev Levitsky Mar 02 '14 at 13:07
  • That worked thank you. But on the plot, each bin takes full values from one integer to the next one, how I can place the bins say from 1.5 to 2.5 for value=2, 2.5-3.5 for 3 etc. I've updated my question. – user40 Mar 02 '14 at 13:16
  • @user40 Does subtracting 0.5 from all bins do what you need? See my edit above – Lev Levitsky Mar 02 '14 at 15:08
  • Would you mind looking at this please: http://stackoverflow.com/questions/22132298/python-read-database-and-plot-degree-distribution – user40 Mar 02 '14 at 22:16
  • Omg I spent 1h of my life trying to center my occurrence counts under the bars... Love you man ! :D – Anton Belev Mar 07 '15 at 21:37
3

You can use groupby from itertools as shown in How to count the frequency of the elements in a list?

import numpy as np
from itertools import groupby
freq = {key:len(list(group)) for key, group in groupby(np.sort(data))}
insipidlight
  • 17
  • 2
  • 5
Ondro
  • 997
  • 5
  • 8
  • Hops, @user40 you are right, so it should be sorted before. Moreover, it would be handy to collect result in dictionary. Code updated. – Ondro Mar 02 '14 at 16:27
3

(Late to the party, just thought I would add a seaborn implementation)

Seaborn Implementation of the above question:

seaborn.__version__ = 0.9.0 at time of writing.

Load the libraries and setup mock data.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

data = np.array([3]*10 + [5]*20 + [7]*5 + [9]*27 + [11]*2)

Plot the data using seaborn.distplot:

Using specified bins, calculated as per the above question.

sns.distplot(data,bins=np.arange(data.min(), data.max()+1),kde=False,hist_kws={"align" : "left"})
plt.show()

Trying numpy built-in binning methods

I used the doane binning method below, which produced integer bins, migth be worth trying out the standard binning methods from numpy.histogram_bin_edges as this is how matplotlib.hist() bins the data.

sns.distplot(data,bins="doane",kde=False,hist_kws={"align" : "left"})
plt.show()

Produces the below Histogram:

enter image description here

RK1
  • 2,384
  • 1
  • 19
  • 36