2

my actual goal is to calculate the difference between two histograms. For this I would like to use the Kullback-Leibler-Divergenz. In this thread Calculating KL Divergence in Python it was said that Scipy's entropy function will calculate KL divergence. For this I need a probability distribution of my datasets. I tried to follow the answers and instructions given in those 2 threads How do I calculate PDF (probability density function) in Python? and How to effectively compute the pdf of a given dataset. Unfortunately I always get an error.

Here you can see my code in which I subdivide the data into 3 parts (training, validation and test dataset) and aim to calculate the pairwise-difference between the data distribution of those 3 sets.

import scipy
from scipy.stats import norm
from scipy.stats import rv_histogram
import numpy as np
import pandas as pd



#Reading the data
df = pd.read_csv("C:/Users/User1/Desktop/TestData_Temperatures.csv", sep=';')

#Build training, validation and test data set      
timeslot_x_train_end = int(len(df)* 0.7)
timeslot_x_valid_end = int(len(df)* 0.9)

data_histogram = df[['temperatures']].values
data_train_histogram = data_histogram [:timeslot_x_train_end]
data_valid_histogram = data_histogram [timeslot_x_train_end:timeslot_x_valid_end]
data_test_histogram = data_histogram [timeslot_x_valid_end:]

#Make histogram out of numpy array
histogram_train = rv_histogram(np.histogram(data_train_histogram, bins='auto'))
histogram_valid = rv_histogram(np.histogram(data_valid_histogram, bins='auto'))
histogram_test = rv_histogram(np.histogram(data_test_histogram, bins='auto'))

#Make probability distribution out of the histogram
pdfs_train = norm.cdf(histogram_train, histogram_train.mean(), histogram_train.std())
pdfs_valid = norm.cdf(histogram_valid, histogram_valid.mean(), histogram_valid.std())
pdfs_test = norm.cdf(histogram_test, histogram_valid.mean(), histogram_valid.std())

#Calculate the entropy between the different datasets
entropy_train_valid = scipy.special.rel_entr(pdfs_train, pdfs_valid)   
entropy_train_test = scipy.special.rel_entr(pdfs_train, pdfs_test) 
entropy_valid_test = scipy.special.rel_entr(pdfs_valid, pdfs_test) 

#Calculate the Kullback–Leibler divergence between the different datasets
kl_div_train_valid = np.sum(entropy_train_valid)
kl_div_train_test = np.sum(entropy_train_test)
kl_div_valid_test = np.sum(entropy_valid_test)

#Print the values of the Kullback–Leibler divergence
print(f"Kullback–Leibler divergence between training and validation dataset: {kl_div_train_valid}")
print(f"Kullback–Leibler divergence between training and test dataset: {kl_div_train_test}")
print(f"Kullback–Leibler divergence between validation and test dataset: {kl_div_valid_test}")

In this setup I get the error message "TypeError: unsupported operand type(s) for -: 'rv_histogram' and 'float'" thrown by the line pdfs_train = norm.cdf(histogram_train, histogram_train.mean(), histogram_train.std()). Here you can see the test dataset TestData. Do you have an idea why I get this error and how I can calculate the probability distribution from the arrays (and eventually the Kullback–Leibler divergence)?

Reminder: Any comments? I'll appreciate every comment.

PeterBe
  • 700
  • 1
  • 17
  • 37

1 Answers1

2

An histogram of a sample can approximate the pdf of the distribution under the assumption that the sample is enough to capture the distribution of the population. So when you use histogram_train = rv_histogram(np.histogram(data_train_histogram, bins='auto')) it generates a distribution given by a histogram. It has a .pdf method to evaluate the pdf and also .rvs to generate values that follow this distribution. So to calculate the Kullback–Leibler divergence between two distributions you can do the following:

#Reading the data
df = pd.read_csv("C:/Users/User1/Desktop/TestData_Temperatures.csv", sep=';')

#Build training, validation   
timeslot_x_train_end = int(len(df)* 0.7)
timeslot_x_valid_end = int(len(df)* 0.9)

data = df[['temperatures']].values
data_train = data[:timeslot_x_train_end]
data_valid = data[timeslot_x_train_end:timeslot_x_valid_end]

#Make distribution objects of the histograms
histogram_dist_train = rv_histogram(np.histogram(data_train, bins='auto'))
histogram_dist_valid = rv_histogram(np.histogram(data_valid, bins='auto'))

#Generate arrays of pdf evaluations
X1 = np.linspace(np.min(data_train), np.max(data_train), 1000)
X2 = np.linspace(np.min(data_valid), np.max(data_valid), 1000)
rvs_train = [histogram_dist_train.pdf(x) for x in X1]
rvs_valid = [histogram_dist_valid.pdf(x) for x in X2]

#Calculate the Kullback–Leibler divergence between the different datasets
entropy_train_valid = scipy.special.rel_entr(rvs_train, rvs_valid)   
kl_div_train_valid = np.sum(entropy_train_valid)

#Print the values of the Kullback–Leibler divergence
print(f"Kullback–Leibler divergence between training and validation dataset: {kl_div_train_valid}")

On the other hand, if you assume that the data have a normal distribution then you must do the following:

#Build training, validation   
timeslot_x_train_end = int(len(df)* 0.7)
timeslot_x_valid_end = int(len(df)* 0.9)

data = df[['temperatures']].values
data_train = data[:timeslot_x_train_end]
data_valid = data[timeslot_x_train_end:timeslot_x_valid_end]

#Make normal distribution objects from data mean and standard deviation
norm_dist_train = norm(data_train.mean(), data_train.std())
norm_dist_valid = norm(data_valid.mean(), data_valid.std())

#Generate arrays of pdf evaluations
X1 = np.linspace(np.min(data_train), np.max(data_train), 1000)
X2 = np.linspace(np.min(data_valid), np.max(data_valid), 1000)
rvs_train = [norm_dist_train.pdf(x) for x in X1]
rvs_valid = [norm_dist_valid.pdf(x) for x in X2]

#Calculate the Kullback–Leibler divergence between the different datasets
entropy_train_valid = scipy.special.rel_entr(rvs_train, rvs_valid)       
kl_div_train_valid = np.sum(entropy_train_valid)

#Print the values of the Kullback–Leibler divergence
print(f"Kullback–Leibler divergence between training and validation dataset: {kl_div_train_valid}")
jomavera
  • 171
  • 5
  • Thanks a lot for your answer. Actually, I don't understand why you don't make a probability distribution out of the data using `pdfs_train = norm.cdf(histogram_train, histogram_train.mean(), histogram_train.std())` like recommended here https://stackoverflow.com/questions/41974615/how-do-i-calculate-pdf-probability-density-function-in-python. Instead you sample random variables. I'm not sure if this is correct. When using your suggested approach, the values for the KL_Divergence are highly flucuating and undeterministic even with a high sample size. It is almost like random guessing. – PeterBe Feb 28 '22 at 15:07
  • When using your second suggested approach with the normal distribution, I always get ìnf` as the result. – PeterBe Feb 28 '22 at 15:09
  • Doing `norm.cdf(histogram_train,histogram_train.mean(), histogram_train.std())` doesn't make sense. Reading the documentation of `norm.cdf(x, loc, scale)` this evaluates the cumulative disitrbution function of a normal distribution with mean `loc` and std `scale` on `x`.`scipy.special.rel_entr` is elementwise function so you must pass as arguments array of values so calculates the relative entropy between two arrays/datasets. The results are highly fluctuating maybe because the data have high variance and the sample (100) is small. – jomavera Feb 28 '22 at 21:41
  • Thanks for your comment and effort jomavera. I really appreciate it. Actually, I think the data does not have so high variance. The variance is 42.7, the standard deviation is 6.5 and the mean is 12. So the coefficient of variation is 0.55 which is not that high as far as I see it. – PeterBe Mar 01 '22 at 10:40
  • Anyways, I strongly increased the sample size to 1000000 but still, the results are strongly flucuating and it seems to me that they are more or less random. This is quite strange for me and I would argue that something is wrong with the approach. Either your suggested approach has some errors or the way I use it, is not correct. I have uploaded the data. Maybe you can try your own code with the data and see, if the values of the KL_divergence are also kind of random guessing? – PeterBe Mar 01 '22 at 10:40
  • There is always a component of randomness since you are comparing two generated random samples so you won't get the same result every time (unless you use the same random seed). By law of large numbers, increasing the sample size you reduce the difference of the results. – jomavera Mar 01 '22 at 13:17
  • Thanks for your comment. I don't think that the component of randomness can explain the results I get when using your suggested approach for the KL_divergence. I have now increased the sample size to 900000000000 (900 billion) and have executed the code 3 times. The values I get for the same dataset are: -18.31, 4.6, -2.38. This can't be possible. I am quite sure, that there is an error in your code/approach. – PeterBe Mar 01 '22 at 13:50
  • You are right. I'm so sorry for my answer. The scipy function doesn't operate on random samples but on a sample of pdf evaluations. Now I'll edit my answer – jomavera Mar 01 '22 at 14:04
  • Thanks a lot jomavera for your answer and effort. Now it looks good and the values are not flucuating any more. I upvoted and accepted your answer. – PeterBe Mar 01 '22 at 15:25
  • Maybe one last follow-up question, how can I decide which of your 2 approaches (histogram-approach, normal-distribution-approach) I should choose and why do I have to distinguish between a normal distribution and a histogram? Actually, a normal distribution is something different than a histogram. So even if the data is normally distributed, the histogram of this distribution should be similar to the densitiy function (for large sample sizes) – PeterBe Mar 01 '22 at 15:25
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/242527/discussion-between-jomavera-and-peterbe). – jomavera Mar 01 '22 at 19:33