I'm trying to analyse the features of the Pima Indians Diabetes Data Set (follow the link to get the dataset) by plotting their probability density distributions. I haven't yet removed invalid 0 data, so the plots sometimes show a bias at the very left. For the most part, the distributions look accurate:
I have a problem with the look of the plot for DiabetesPedigree, which shows probabilities over 1.0 (for x ~ between 0.1 and 0.5). As I understand it, the combined probabilities should equal 1.0.
I've isolated the code for the DiatebesPedigree plot but the same will work for the others by changing the dataset_index
value:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
dataset_index = 6
feature_name = "DiabetesPedigree"
filename = 'pima-indians-diabetes.data.csv'
data = pd.read_csv(filename)
feature_data = data.ix[:, dataset_index]
graph_min = feature_data.min()
graph_max = feature_data.max()
density = gaussian_kde(feature_data)
density.covariance_factor = lambda : .25
density._compute_covariance()
xs = np.arange(graph_min, graph_max, (graph_max - graph_min)/200)
ys = density(xs)
plt.xlim(graph_min, graph_max)
plt.title(feature_name)
plt.plot(xs,ys)
plt.show()