Making Probability Distribution Functions (PDFs) from histograms

Question

Say I have several histograms, each with counts at different bin locations (on a real axis). e.g.

def generate_random_histogram():

    # Random bin locations between 0 and 100
    bin_locations = np.random.rand(10,) * 100
    bin_locations.sort()

    # Random counts between 0 and 50 on those locations 
    bin_counts = np.random.randint(50, size=len(bin_locations))
    return {'loc': bin_locations, 'count':bin_counts}

# We can assume that the bin size is either pre-defined or that 
# the bin edges are on the middle-point between consecutive counts.
hists = [generate_random_histogram() for x in xrange(3)]

How can I normalize these histograms so that I get PDFs where the integral of each PDF adds up to one within a given range (e.g. 0 and 100)?

We can assume that the histogram counts events on pre-defined bin size (e.g. 10)

Most implementations I have seen are based, for example, on Gaussian Kernels (see scipy and scikit-learn) that start from the data. In my case, I need to do this from the histograms, since I don't have access to the original data.

Update:

Note that all current answers assume that we are looking at a random variable that lives in (-Inf, +Inf). That's fine as a rough approximation, but this may not be the case depending on the application, where the variable may be defined within some other range [a,b] (e.g. 0 and 100 in the above case)

I think this is a very good question, I'd vote, but I've run out of votes for the day. I agree, I don't think you should start out with the assumption of normality. — Russia Must Remove Putin, Mar 18 '14 at 15:21
A KDE (gaussian or otherwise) works inherently on the data, so it won't help you in your case. — askewchan, Mar 18 '14 at 15:22
What will the pdf be used for? Are there any constraints on it (such as known lower or upper bounds on the data)? Do you know anything else about the distribution these data should have? In any case, I think you should think along the lines of mixture distributions. Which one you use specifically would depend on the answers to these questions. — jcrudy, Mar 18 '14 at 15:23
Thanks @jcrudy I probably have a mixture of gaussians, but I don't know the number of mixtures. Does that help? — Amelio Vazquez-Reina, Mar 18 '14 at 15:24
Do you expect that `bin_locations` describes the centers of bins? Otherwise, (if they are bin edges) you should have `len(bin_counts) = len(bin_locations) - 1` — askewchan, Mar 18 '14 at 15:24
There was a thread about this subject on the scikit learn mailing list a while ago. Here is a link: http://sourceforge.net/p/scikit-learn/mailman/message/31984328/ — jcrudy, Mar 18 '14 at 15:40
When you say "have same range of X" do you mean all the `bin_locations` should be the same, or do you just mean `range` as in the `range` keyword arg to `np.histogram` which indicates just the lower and upper bounds of the domain? — askewchan, Mar 18 '14 at 15:47
@askewchan: What I mean is that we can assume that all the probability is contained within a specified range (`0` to `100` in the OP) for every histogram. — Amelio Vazquez-Reina, Mar 18 '14 at 15:51
You've edited it slightly, but to me "PDF" is still not well defined in your question. Do you want a true function that's defined on the entire domain (like what `gaussian_kde` or an interpolation would return)? Or is it ok to have an array-like PDF with values at `bin_locations`, as long as it is properly normalized on the domain? Or maybe you want something in betweeen, say a standardized evenly spaced bins that the values of all histograms are given at (like `np.linspace(0,100)`)? — askewchan, Mar 18 '14 at 16:59
@askewchan I want a true PDF defined on a pre-specified range, in this case `[0,100]`. For it to be a PDF, the integral of the PDF within the range should be 1. Since we are working on the real domain, I should be able to discretize it at, say, regular intervals (e.g. using `np.linspace`). — Amelio Vazquez-Reina, Mar 18 '14 at 17:06
If the PDF integrates to 1 on [0, 100], then it is not a mixture of Gaussian's. Do you mean to imply that a support of [0, 100] is a hard constraint, or are you only trying to illustrate that it must be a true pdf for a continuous distribution with support that includes [0,100]? — jcrudy, Mar 18 '14 at 17:18
@jcrudy - You are correct, they wouldn't be a mixture of Gaussians, but if having the constraint that the density is constrained within `[a,b]` is too difficult, we can relax the constraint to `(-Inf,Inf)` — Amelio Vazquez-Reina, Mar 18 '14 at 17:20
Then you must interpolate. There are many options, and without some more information, there's no way to maximize entropy. The kde I posted earlier integrates to 1 on (-inf, inf), and the normalized array integrates to 1 on [0,100], but I assumed you found those unsatisfactory. — askewchan, Mar 18 '14 at 17:25
"We can assume that the histogram counts events on pre-defined bin size (e.g. 10)" We cannot if you give random bin centers. How could these `bin_locations` have uniform binsize: `[.1, .2, 5, 50, 90]`? — askewchan, Mar 18 '14 at 17:27
I just posted a GMM possibility. It is kind of weak because it involves resampling from the histogram and then fitting a GMM to the resampled data. Have you considered using pymc? With pymc it might be possible to do something a bit more efficient (statistically speaking). However, Bayesian inference can be a rabbit hole. — jcrudy, Mar 18 '14 at 17:28
@askewchan We can make the `bin_size` much smaller if that's causing a problem. Technically speaking, to make the problem well-defined, the histogram counts **must** correspond to **intervals** (otherwise it doesn't make sense to talk about PDFs). This data is fake after all, but each count must correspond to a well-defined interval. — Amelio Vazquez-Reina, Mar 18 '14 at 17:29
Thanks @jcrudy - I am looking at it now. Yes - t tagged the question with `pymc`, since I thought `pymc` could help with it. — Amelio Vazquez-Reina, Mar 18 '14 at 17:32
Oh, I hadn't noticed the tag. I'll see if I can whip up a quick pymc example. — jcrudy, Mar 18 '14 at 17:39

askewchan · Accepted Answer · 2014-03-18T19:07:14.580

6

The main point of delicacy is defining bin_edges since technically they could be anywhere. I chose the midpoint between each pair of bin centers. Probably there are other ways to do this, but here is one:

hists = [generate_random_histogram() for x in xrange(3)]
for h in hists:
    bin_locations = h['loc']
    bin_counts = h['count']
    bin_edges = np.concatenate([[0], (bin_locations[1:] + bin_locations[:-1])/2, [100]])
    bin_widths = np.diff(bin_edges)
    bin_density = bin_counts.astype(float) / np.dot(bin_widths, bin_counts)
    h['density'] = bin_density

    data = np.repeat(bin_locations, bin_counts)
    h['kde'] = gaussian_kde(data)

    plt.step(bin_locations, bin_density, where='mid', label='normalized')
    plt.plot(np.linspace(0,100), h['kde'](np.linspace(0,100)), label='kde')

Which would result in plots like the following (one for each histogram): hists

edited Mar 18 '14 at 19:07

answered Mar 18 '14 at 15:45

askewchan

45,161
17
118
134

I am confused. What is this supposed to compute? When I run this, I get an array full of 0s for `bin_counts`. – Amelio Vazquez-Reina Mar 18 '14 at 15:50
Sorry, make your `bin_counts` have dtype of float first. I'll edit. – askewchan Mar 18 '14 at 15:53
Thanks - What code exactly from your answer you used to generate the bottom plot? (which looks roughly right). Was it the one using "fake" data? – Amelio Vazquez-Reina Mar 18 '14 at 17:34
All of it, run together. Be sure to run the 'kde' part before normalizing the `bin_counts` array (since I normalized in place) and the kde weights using `np.repeat` and needs the counts to be integers. Note that the kde has finite probability outside the domain. – askewchan Mar 18 '14 at 17:35
Thanks - Although it is still not clear to me. Would you mind, perhaps, re-organizing the answer to show what code snippet you run **from end to end** to get the plot? – Amelio Vazquez-Reina Mar 18 '14 at 17:36
2

Sorry. Edited --- now run all the formatted code, in order, and you should get the plot. (after you've created the `hists` list of random histograms, of course). This just plots the last entry in `hists`. Put the `plt` functions inside the loop if you want to plot each and every one. – askewchan Mar 18 '14 at 17:37
Hmm I don't get the same result. I think you normalized the `bin_counts` before `plt.step` so that the range is roughly equivalent. How did you normalize the counts? – Amelio Vazquez-Reina Mar 18 '14 at 17:55
1

OH sorry, changed avariable name. `plt.step` with `bin_density` now – askewchan Mar 18 '14 at 17:56
I updated your answer to reflect that change. I hope you don't mind – Amelio Vazquez-Reina Mar 18 '14 at 18:00
1

Thanks for the accept. I think to have the full constraint at on `[a,b]`, you'll have to do a least-squares fit or something with the constraint that the integral equals one. The question then is, what functional for would you fit (here, it's a stepwise constant or a sum of gaussians). – askewchan Mar 21 '14 at 17:50
Yes - I also just run into [`seaborn`](http://www.stanford.edu/~mwaskom/software/seaborn/plotting_distributions.html). I have used for example their `violinplots`, which seem to estimate a PDF from histogram data. Not sure what's under the hood. – Amelio Vazquez-Reina Mar 21 '14 at 17:59

score 3 · Answer 2 · answered Mar 18 '14 at 17:25

Here is one possibility. I'm not that crazy about it, but it kind of works. Note that the histograms are kind of weird, as the bin width is quite variable.

import numpy as np
from matplotlib import pyplot
from sklearn.mixture.gmm import GMM
from sklearn.grid_search import GridSearchCV

def generate_random_histogram():
    # Random bin locations between 0 and 100
    bin_locations = np.random.rand(10,) * 100
    bin_locations.sort()

    # Random counts on those locations
    bin_counts = np.random.randint(50, size=len(bin_locations))
    return {'loc': bin_locations, 'count':bin_counts}

def bin_widths(loc):
    widths = []
    for i in range(len(loc)-1):
        widths.append(loc[i+1] - loc[i])
    widths.append(widths[-1])
    widths = np.array(widths)
    return widths

def sample_from_hist(loc, count, size):
    n = len(loc)
    tot = np.sum(count)
    widths = bin_widths(loc)
    lowers = loc - widths
    uppers = loc + widths
    probs = count / float(tot)
    bins = np.argmax(np.random.multinomial(n=1, pvals=probs, size=(size,)),1)
    return np.random.uniform(lowers[bins], uppers[bins])

# Generate the histogram
hist = generate_random_histogram()

# Sample from the histogram
sample = sample_from_hist(hist['loc'],hist['count'],np.sum(hist['count']))

# Fit a GMM
param_grid = [{'n_components':[1,2,3,4,5]}]
def scorer(est, X, y=None):
    return np.sum(est.score(X))
grid_search = GridSearchCV(GMM(), param_grid, scoring=scorer).fit(np.reshape(sample,(len(sample),1)))
gmm = grid_search.best_estimator_

# Sample from the GMM
gmm_sample = gmm.sample(np.sum(hist['count']))

# Plot the resulting histograms and mixture distribution
fig = pyplot.figure()
ax1 = fig.add_subplot(311)
widths = bin_widths(hist['loc'])
ax1.bar(hist['loc'], hist['count'], width=widths)
ax2 = fig.add_subplot(312, sharex=ax1)
ax2.hist(gmm_sample, bins=hist['loc'])
x = np.arange(min(sample), max(sample), .1)
y = np.exp(gmm.score(x))
ax3 = fig.add_subplot(313, sharex=ax1)
ax3.plot(x, y)
pyplot.show()

resulting plot

This is a great answer. Thanks @jcrudy. I **certainly** would be interested in seeing a solution with PyMC. — Amelio Vazquez-Reina, Mar 18 '14 at 17:45

score 2 · Answer 3 · answered Mar 19 '14 at 01:23

Here's a way to do it with pymc. This method uses a fixed number of components (n_components) in the mixture model. You could try attaching a prior to n_components and sampling over that prior. Alternatively, you could just pick something reasonable or use the grid search technique from my other answer to estimate the number of components. In the below code I used 10000 iterations with a burn in of 9000, but that might not be sufficient to get good results. I would suggest using a much larger burn in. I also chose the priors somewhat arbitrarily, so those might be something to look at. You'll have to experiment with it. Best of luck to you. Here is the code.

import numpy as np
import pymc as mc
import scipy.stats as stats
from matplotlib import pyplot

def generate_random_histogram():
    # Random bin locations between 0 and 100
    bin_locations = np.random.rand(10,) * 100
    bin_locations.sort()

    # Random counts on those locations
    bin_counts = np.random.randint(50, size=len(bin_locations))
    return {'loc': bin_locations, 'count':bin_counts}


def bin_widths(loc):
    widths = []
    for i in range(len(loc)-1):
        widths.append(loc[i+1] - loc[i])
    widths.append(widths[-1])
    widths = np.array(widths)
    return widths


def mixer(name, weights, value=None):
    n = value.shape[0]
    def logp(value, weights):
        vals = np.zeros(shape=(n, weights.shape[1]), dtype=int)
        vals[:, value.astype(int)] = 1
        return mc.multinomial_like(x = vals, n=1, p=weights)

    def random(weights):
        return np.argmax(np.random.multinomial(n=1, pvals=weights[0,:], size=n), 1)

    result = mc.Stochastic(logp = logp,
                             doc = 'A kernel smoothing density node.',
                             name = name,
                             parents = {'weights': weights},
                             random = random,
                             trace = True,
                             value = value,
                             dtype = int,
                             observed = False,
                             cache_depth = 2,
                             plot = False,
                             verbose = 0)
    return result

def create_model(lowers, uppers, count, n_components):
    n = np.sum(count)
    lower = min(lowers)
    upper = min(uppers)
    locations = mc.Uniform(name='locations', lower=lower, upper=upper, value=np.random.uniform(lower, upper, size=n_components), observed=False)
    precisions = mc.Gamma(name='precisions', alpha=1, beta=1, value=.001*np.ones(n_components), observed=False)
    weights = mc.CompletedDirichlet('weights', mc.Dirichlet(name='weights_ind', theta=np.ones(n_components)))

    choice = mixer('choice', weights, value=np.ones(n).astype(int))

    @mc.observed
    def histogram_data(value=count, locations=locations, precisions=precisions, weights=weights):
        if hasattr(weights, 'value'):
            weights = weights.value

        lower_cdfs = sum([weights[0,i]*stats.norm.cdf(lowers, loc=locations[i], scale=np.sqrt(1.0/precisions[i])) for i in range(len(weights))])
        upper_cdfs = sum([weights[0,i]*stats.norm.cdf(uppers, loc=locations[i], scale=np.sqrt(1.0/precisions[i])) for i in range(len(weights))])

        bin_probs = upper_cdfs - lower_cdfs
        bin_probs = np.array(list(upper_cdfs - lower_cdfs) + [1.0 - np.sum(bin_probs)])
        n = np.sum(count)
        return mc.multinomial_like(x=np.array(list(count) + [0]), n=n, p=bin_probs)

    @mc.deterministic
    def location(locations=locations, choice=choice):
        return locations[choice.astype(int)]

    @mc.deterministic
    def dispersion(precisions=precisions, choice=choice):
        return precisions[choice.astype(int)]

    data_generator = mc.Normal('data', mu=location, tau=dispersion)

    return locals()

# Generate the histogram
hist = generate_random_histogram()
loc = hist['loc']
count = hist['count']
widths = bin_widths(hist['loc'])
lowers = loc - widths
uppers = loc + widths

# Create the model
model = create_model(lowers, uppers, count, 5)

# Initialize to the MAP estimate
model = mc.MAP(model)
model.fit(method ='fmin')

# Now sample with MCMC
model = mc.MCMC(model)
model.sample(iter=10000, burn=9000, thin=300)

# Plot the mu and tau traces
mc.Matplot.plot(model.trace('locations'))
pyplot.show()

# Get the samples from the fitted pdf
sample = np.ravel(model.trace('data')[:])

# Plot the original histogram, sampled histogram, and pdf
lower = min(lowers)
upper = min(uppers)
kde = stats.gaussian_kde(sample)
x = np.arange(0,100,.1)
y = kde(x)
fig = pyplot.figure()
ax1 = fig.add_subplot(311)
pyplot.xlim(lower,upper)
ax1.bar(loc, count, width=widths)
ax2 = fig.add_subplot(312, sharex=ax1)
ax2.hist(sample, bins=loc)
ax3 = fig.add_subplot(313, sharex=ax1)
ax3.plot(x, y)
pyplot.show()

And as you can see, the two distributions don't look terribly alike. However, a histogram is not much to go off of. I would play with different numbers of components and more iterations / burn in, but it's a project. Depending on your priorities, I suspect either @askewchan's or my other answer might serve you better.

resulting plot

This is a very interesting answer. Something I don't understand is, from a statistical standpoint, why @askewchan's answer can get away without having to estimate the number of mixtures. Is he using a non-parametric Bayesian method? How does your answer relate to his? — Josh, Mar 19 '14 at 02:31
He certainly seems to be using a non-parametric method. This made me wonder, does PyMC include any built-ins to model non-parametric distributions? — Josh, Mar 19 '14 at 02:34
I haven't looked at @askewchan's answer in detail, but if I understand correctly he is implicitly defining a GMM with one component for each bar in the histogram, where the weight for that component is the number of data points contributing to that bar and the variance is determined by the bandwidth of the KDE. In this answer, I simply choose the number of components in the mixture arbitrarily. In my other answer, I use a grid search to find the number of components. I think pymc now has a kernel density based stochastic node available. See http://stackoverflow.com/a/17934913/1572508. — jcrudy, Mar 19 '14 at 16:51
From a statistical standpoint, I'm not sure how much any of these three methods can be justified. The question to ask is: how similar can we say the estimated pdf is to the pdf of the data that generated the original histogram? With these histograms, I think there is not much hope of a reliable reconstruction, simply because the bins are wide and there aren't that many data points. However, they are just examples. A histogram with narrower bins and more data would do better. The way to find out would be to do simulations with known pdfs. — jcrudy, Mar 19 '14 at 16:57

Making Probability Distribution Functions (PDFs) from histograms

Update:

3 Answers3