4

I have an experiment that involves using sensors and I have around 5 data files contains data collected from the sensors are in the time domain. For simplicity, let's say we concentrate on one sensor and I would require to obtain the probability distributions for all data files. I looked online and managed to find the best fit distribution using the following link:

Fitting empirical distribution to theoretical ones with Scipy (Python)

For my case, it turns out that normal distribution fits my data. So, I have multiple distributions and would like to combine them all into one distribution. What I did was that I averaged each probability densities by getting each density values and divide it by 5.

The average code is done using the following code:

def average(l):
    llen = len(l)
    def divide(x):
        return x / llen
    return map(divide, map(sum, zip(*l)))

for _ in range(5):
        # read sensor data
        # Obtain the probability distribution using code in the first link
        # Getting list of pdf:
        np_pdf = list(y_axis_pdf)

        lt.append(np_pdf)

Average_list = average(lt)
Average_list = list(Average_list)

However, I asked a couple of people and searched online and it said that averaging is not the best way. So, what could be the correct way to combine several probability distributions together into one?

My second question is that I searched online and found this article:

How to Combine Independent Data Sets for the Same Quantity

How is it possible to use the code from the first link into the method in the article?

Edit 1:

Based on comment from @SeverinPappadeux, I edited my code and it is the following:

# Combining all PDF files into one dataset:
pdf_data = [np_pdf_01, np_pdf_02, np_pdf_03, np_pdf_04, np_pdf_05]
pdf_dataframe_ini = pd.DataFrame(pdf_data)
pdf_dataframe = pd.DataFrame.transpose(pdf_dataframe_ini)

# Creating one PDF from the PDF dataset:
gmm = GMM(n_components=1)
gmm.fit(pdf_dataframe)
x_pdf_data = [x_axis_pdf_01, x_axis_pdf_02, x_axis_pdf_03, x_axis_pdf_04, x_axis_pdf_05]
x_pdf = average(x_pdf_data)
x_pdf = list(x_pdf)
x = np.linspace(np.min(x_pdf), np.max(x_pdf), len(x_pdf)).reshape(len(x_pdf), 1)
logprob = gmm.score_samples(x)
pdf = np.exp(logprob)

I keep on getting the following error:

logprob = gmm.score_samples(x)
ValueError: Expected the input data X have 10 features, but got 1 features

How Can I solve this error and get the pdf plot for combined pdfs?

Sources:

How can I plot the probability density function for a fitted Gaussian mixture model under scikit-learn?

Edit 2:

I tried to implement Multivariate normal in order to combine several distributions together, however, I got the following error message:

ValueError: shapes (5,2000) and (1,1) not aligned: 2000 (dim 1) != 1 (dim 0)

How would I solve this error? Find below for the code:

Code:

import scipy.stats as st
import numpy as np
import pandas as pd
import scipy.stats as st
from matplotlib import pyplot as plt
from scipy.integrate import quad,simps, quad_vec, nquad
import winsound
from functools import reduce
from itertools import chain
import scipy.stats as st
from glob import glob
from collections import defaultdict, Counter
from sklearn.neighbors import KDTree
import pywt
import peakutils
import scipy
import os
from scipy import signal
from scipy.fftpack import fft, fftfreq, rfft, rfftfreq, dst, idst, dct, idct
from scipy.signal import find_peaks, find_peaks_cwt, argrelextrema, welch, lfilter, butter, savgol_filter, medfilt, freqz, filtfilt
from pylab import *
import glob
import sys
import re
from numpy import NaN, Inf, arange, isscalar, asarray, array
from scipy.stats import skew, kurtosis, median_absolute_deviation
import warnings
import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, kendalltau, spearmanr, ppcc_max
import matplotlib.mlab as mlab
from statsmodels.graphics.tsaplots import plot_acf
from tsfresh.feature_extraction.feature_calculators import mean_abs_change as mac
from tsfresh.feature_extraction.feature_calculators import mean_change as mc
from tsfresh.feature_extraction.feature_calculators import mean_second_derivative_central as msdc
from pyAudioAnalysis.ShortTermFeatures import energy as stEnergy
import pymannkendall as mk_test
from sklearn.preprocessing import MinMaxScaler, Normalizer, normalize, StandardScaler
import time
from tsfresh.feature_extraction.feature_calculators import mean_abs_change as mac
from tsfresh.feature_extraction.feature_calculators import mean_change as mc
from tsfresh.feature_extraction.feature_calculators import absolute_sum_of_changes as asc
from tsfresh.feature_extraction.feature_calculators import mean_second_derivative_central as msdc
from sklearn.decomposition import PCA, KernelPCA, SparsePCA, IncrementalPCA
from sklearn.preprocessing import MinMaxScaler, Normalizer, normalize, StandardScaler
import circle_fit as cf
from scipy import optimize
import functools
from math import sqrt, pi
from ellipse import LsqEllipse
import time
from matplotlib.patches import Ellipse
import pandas as pd
import numpy as np
import time
from mlxtend.feature_extraction import PrincipalComponentAnalysis
from sklearn.pipeline import make_pipeline
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns # data visualization library
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from matplotlib.colors import ListedColormap
from scipy.stats import f
# from statsmodels import api as sm
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import (KNeighborsClassifier,NeighborhoodComponentsAnalysis)
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report
from sklearn.cross_decomposition import PLSRegression
from sklearn.covariance import EmpiricalCovariance, MinCovDet
from sklearn.decomposition import kernel_pca, KernelPCA
from sklearn.decomposition import sparse_pca, SparsePCA
from sklearn.decomposition import incremental_pca, IncrementalPCA
from sklearn.manifold import TSNE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, normalize
from sklearn import linear_model
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.model_selection import KFold
from sklearn import linear_model
from sklearn.metrics import make_scorer
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn import svm
from sklearn.metrics import r2_score
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from scipy.signal import savgol_filter
# import tflearn
# import tensorflow as tf
from statistics import mean
import seaborn
import warnings
from sklearn import preprocessing, neighbors
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from scipy.stats import mstats, multivariate_normal

def normalizer(list_values):
    norm = [float(i) / sum(list_values) for i in list_values]
    return norm

lb=-10
ub=10
domain=np.arange(lb,ub,.01)
domain_size=domain.shape[0]
print(domain_size)

dist_1 = st.norm.pdf(domain, 2,1)
dist_2 = st.norm.pdf(domain, 2.5,1.5)
dist_3 = st.norm.pdf(domain, 2.2,1.6)
dist_4 = st.norm.pdf(domain, 2.4,1.3)
dist_5 = st.norm.pdf(domain, 2.7,1.5)

# dist_1_norm = normalizer(dist_1)
# dist_2_norm = normalizer(dist_2)
# dist_3_norm = normalizer(dist_3)
# dist_4_norm = normalizer(dist_4)
# dist_5_norm = normalizer(dist_5)
dists=[dist_1, dist_2, dist_3, dist_4, dist_5]

plt.xlabel("domain")
plt.ylabel("pdf")
plt.title("Conflated PDF")
plt.legend()
plt.plot(domain, st.norm.pdf(domain, 2,1), 'r', label='Dist. 1')
plt.plot(domain, st.norm.pdf(domain, 2.5,1.5), 'g', label='Dist. 2')
plt.plot(domain, st.norm.pdf(domain, 2.2,1.6), 'b', label='Dist. 3')
plt.plot(domain, st.norm.pdf(domain, 2.4,1.3), 'y', label='Dist. 4')
plt.plot(domain, st.norm.pdf(domain, 2.7,1.5), 'c', label='Dist. 5')

dists=[dist_1, dist_2, dist_3, dist_4, dist_5]
graph=multivariate_normal.pdf(dists)

plt.plot(domain,graph, 'm', label='Combined Dist.')
plt.legend()
plt.show()
WDpad159
  • 359
  • 3
  • 15
  • 2
    Why not aggregate all the samples into one dataset and fit that? – Mad Physicist Sep 17 '20 at 17:04
  • @MadPhysicist you mean after I obtain the pdf, I create a dataset containing all 5 PDF and fit? Also, if I create the dataset, how can I fit the dataset? – WDpad159 Sep 17 '20 at 17:08
  • No I mean take all your data as a whole and make one PDF – Mad Physicist Sep 17 '20 at 17:08
  • @MadPhysicist How can I do that? – WDpad159 Sep 17 '20 at 17:13
  • Spend some time thinking about it. You haven't provided enough information for me to be able to answer that. – Mad Physicist Sep 17 '20 at 17:14
  • @MadPhysicist Not sure this is right idea - combine them together. If there are 5 physically different sensors, each slightly differs from another one, e.g. giving systematically different means, stdddev etc, mixture distribution might be the answer – Severin Pappadeux Sep 17 '20 at 18:12
  • @WDpad159 you might want to look at https://jakevdp.github.io/PythonDataScienceHandbook/05.12-gaussian-mixtures.html – Severin Pappadeux Sep 17 '20 at 18:13
  • @SeverinPappadeux Well, my case is that I am measuring data from multiple sensors and I repeated the experiment 5 times = (5 data files). So, just for simplicity I just want to concentrate on one sensor. I'll have a look at GMM. Is there any other way than GMM? and will it be possible to obtain like PDF plot from GMM? Also, is it possible to use multivariate normal pdf? – WDpad159 Sep 17 '20 at 18:35
  • @SeverinPappadeux. It was my understanding that OP is referring to multiple runs of the same sensor. If course if there are multiple sensors, you should create a model for each one. – Mad Physicist Sep 17 '20 at 18:44
  • @MadPhysicist I was checking the data individually and found out that the probability distribution peaks are just apart by 0.1. So, when implementing the GMM what would be the possible outcome? – WDpad159 Sep 17 '20 at 21:40
  • @MadPhysicist I want to ask you a simple question, I know how to multiply elements from two lists but how can I do the multiplication and summation for more than 2 lists? Because I want to implement the conflation equation in (How to Combine Independent Data Sets for the Same Quantity) link [equation can be found at the bottom of page 5] – WDpad159 Sep 18 '20 at 17:06
  • @WDpad159 You could try multivariate normal, why not? But it is a lot more tightly coupled model, you have to get not only marginals right, but conditionals as well (e.g. distribution for x1 conditioned on values x2, x3, x4, ...) – Severin Pappadeux Sep 19 '20 at 19:54
  • @SeverinPappaduex I tried to implement multivariate normal but I keep on getting the following error `ValueError: shapes (5,2000) and (1,1) not aligned: 2000 (dim 1) != 1 (dim 0)`. I inputted the list of pdf_data as input to the multivariate normal pdf but I was not sure if I am doing it right. See the edited question to see my approach – WDpad159 Apr 07 '21 at 12:21

0 Answers0