5

I'm trying to analyse the features of the Pima Indians Diabetes Data Set (follow the link to get the dataset) by plotting their probability density distributions. I haven't yet removed invalid 0 data, so the plots sometimes show a bias at the very left. For the most part, the distributions look accurate:

All Probability Density Distributions

I have a problem with the look of the plot for DiabetesPedigree, which shows probabilities over 1.0 (for x ~ between 0.1 and 0.5). As I understand it, the combined probabilities should equal 1.0.

Probability Density Distribution for DiatebesPedigree

I've isolated the code for the DiatebesPedigree plot but the same will work for the others by changing the dataset_index value:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

dataset_index = 6
feature_name = "DiabetesPedigree"
filename = 'pima-indians-diabetes.data.csv'

data = pd.read_csv(filename)
feature_data = data.ix[:, dataset_index]

graph_min = feature_data.min()
graph_max = feature_data.max()

density = gaussian_kde(feature_data)
density.covariance_factor = lambda : .25
density._compute_covariance()

xs = np.arange(graph_min, graph_max, (graph_max - graph_min)/200)
ys = density(xs)

plt.xlim(graph_min, graph_max)
plt.title(feature_name)
plt.plot(xs,ys)

plt.show()
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
maccaroo
  • 819
  • 2
  • 12
  • 22
  • 3
    The *integral* over a pdf is 1. There is no contradiction to be seen here. You can quickly calculate some rough estime: The part between 0. and 0.5 has an average value of 1.5, The part between 0.5 and 1 has an average value of 0.5. The rest of the curve is negligible. Then 0.5*1.5+0.5*0.5 =1. So everything seems correct. – ImportanceOfBeingErnest Sep 27 '17 at 10:24
  • @ImportanceOfBeingErnest - My understanding is that the probability of a particular value (or small range) can be read off the graph by reading the corresponding y-value at that point. The highest possible probability is 1.0, which means the value is certain, in which case all other points should have a 0 value. A probability of 1.75 does not make sense to me. By your reasoning, all the other graphs have integrals way below 1.0. – maccaroo Sep 27 '17 at 12:06
  • 2
    In that case you probably want to look into some statistics or math book or google for KDE and PDF to adjust your understanding of PDF/KDE. In all cases you show the integral is 1 as expected. – ImportanceOfBeingErnest Sep 27 '17 at 12:25
  • Why the downvote? It's a well-formed, valid question. – maccaroo Sep 11 '18 at 01:30

1 Answers1

0

As rightly marked , a continous pdf never says the value to be less than 1, with the pdf for continous random variable, function p(x) is not the probability. you can refer for continuous random varibales and their distrubutions