Why does this Kernel Density Estimation have values over 1.0?

Question

I'm trying to analyse the features of the Pima Indians Diabetes Data Set (follow the link to get the dataset) by plotting their probability density distributions. I haven't yet removed invalid 0 data, so the plots sometimes show a bias at the very left. For the most part, the distributions look accurate:

I have a problem with the look of the plot for DiabetesPedigree, which shows probabilities over 1.0 (for x ~ between 0.1 and 0.5). As I understand it, the combined probabilities should equal 1.0.

I've isolated the code for the DiatebesPedigree plot but the same will work for the others by changing the dataset_index value:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

dataset_index = 6
feature_name = "DiabetesPedigree"
filename = 'pima-indians-diabetes.data.csv'

data = pd.read_csv(filename)
feature_data = data.ix[:, dataset_index]

graph_min = feature_data.min()
graph_max = feature_data.max()

density = gaussian_kde(feature_data)
density.covariance_factor = lambda : .25
density._compute_covariance()

xs = np.arange(graph_min, graph_max, (graph_max - graph_min)/200)
ys = density(xs)

plt.xlim(graph_min, graph_max)
plt.title(feature_name)
plt.plot(xs,ys)

plt.show()

The *integral* over a pdf is 1. There is no contradiction to be seen here. You can quickly calculate some rough estime: The part between 0. and 0.5 has an average value of 1.5, The part between 0.5 and 1 has an average value of 0.5. The rest of the curve is negligible. Then 0.5*1.5+0.5*0.5 =1. So everything seems correct. — ImportanceOfBeingErnest, Sep 27 '17 at 10:24
@ImportanceOfBeingErnest - My understanding is that the probability of a particular value (or small range) can be read off the graph by reading the corresponding y-value at that point. The highest possible probability is 1.0, which means the value is certain, in which case all other points should have a 0 value. A probability of 1.75 does not make sense to me. By your reasoning, all the other graphs have integrals way below 1.0. — maccaroo, Sep 27 '17 at 12:06
In that case you probably want to look into some statistics or math book or google for KDE and PDF to adjust your understanding of PDF/KDE. In all cases you show the integral is 1 as expected. — ImportanceOfBeingErnest, Sep 27 '17 at 12:25

score 0 · Answer 1 · answered Sep 27 '17 at 12:33

0

As rightly marked , a continous pdf never says the value to be less than 1, with the pdf for continous random variable, function p(x) is not the probability. you can refer for continuous random varibales and their distrubutions

answered Sep 27 '17 at 12:33

user8662125

11
1

Why does this Kernel Density Estimation have values over 1.0?

1 Answers1

Linked