Generation of distribution in python using scientific libraries
-
Couldn't you just use the MinMaxScaler? – Kordi Mar 02 '16 at 08:41
3 Answers
At the moment couldn't try your code but you could limit the input of the scaler like this
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
data_scaled = min_max_scaler.fit_transform([data])
1 is here the wrong value, but it should only show the concept. Link to the Documentation http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler

- 2,405
- 1
- 14
- 13
-
Did you look at http://stackoverflow.com/questions/33162871/python-scikit-learn-svm-classifier-valueerror-found-array-with-dim-3-expected ? – Kordi Mar 02 '16 at 09:33
Let's start with
import pandas as pd
from scipy.stats import norm
from sklearn import preprocessing
from sklearn import mixture
import numpy as np
df = pd.read_csv('test2.csv')
Cleaning up:
df.dropna(inplace=True)
Following that, you want to apply a log to all data. It usually pays to impute the data a bit for 0 (or close to 0) values. The factor alpha
determines the imputation factor - 0 means no imputation at all.
alpha = 0.01
m = df.as_matrix()
m = alpha * np.ones_like(m) + (1 - alpha) * m
m = np.log(m)
Scaling:
m = preprocessing.scale(m)
Now, as the data is large, I had to sample it a bit for the following. Here's a sample of 1000 rows:
m = m[np.random.choice(range(m.shape[0]), 1000), :]
The mean and covariance can be found with
mu, sigma = np.mean(m), np.cov(m)
These two parameters determine the distribution completely. From here on, you can do lots of stuff, e.g., generating further values from the fit distribution.

- 1
- 1

- 74,578
- 11
- 141
- 185
-
@usert4jju7 Let's do it step by step. Which step is giving you the error you mentioned? – Ami Tavory Mar 02 '16 at 09:59
-
@usert4jju7 I've been a bit busy - will be able to look at it later. – Ami Tavory Mar 05 '16 at 12:37
-
@usert4jju7 So the thing is that the matrix contains negative values, which is why you can't take the log + scale (the log is undefined). Are you sure you want to take a log? Why are you doing that? – Ami Tavory Mar 05 '16 at 18:41
-
I don't know a solution for your coding problem. But maybe you can consider using another package. OpenTURNS is a python package with many handy things for statistics. You could use the Student distribution. It also offers a multi-variate version.
You also wrote you get 'a' t-dist but not the one you need. You could also try the check out the non-central Student distribution. If that is the case you may need to use copulas in order to create correlated marginals.
import openturns as ot
nu = 2
mu = [0.8,0.2]
sigma = [1.2,1.0]
R = ot.CorrelationMatrix(2)
# fill R as needed
print(R)
dist = ot.Student(nu, mu, sigma, R)
#this will draw PDF (for max 2 dims.)
dist.drawPDF()

- 95
- 1
- 1
- 7