0

Generation of distribution in python using scientific libraries

usert4jju7
  • 1,653
  • 3
  • 27
  • 59

3 Answers3

1

At the moment couldn't try your code but you could limit the input of the scaler like this

min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
data_scaled = min_max_scaler.fit_transform([data])

1 is here the wrong value, but it should only show the concept. Link to the Documentation http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler

Kordi
  • 2,405
  • 1
  • 14
  • 13
  • Did you look at http://stackoverflow.com/questions/33162871/python-scikit-learn-svm-classifier-valueerror-found-array-with-dim-3-expected ? – Kordi Mar 02 '16 at 09:33
1

Let's start with

import pandas as pd
from scipy.stats import norm
from sklearn import preprocessing
from sklearn import mixture
import numpy as np

df = pd.read_csv('test2.csv')

Cleaning up:

df.dropna(inplace=True)

Following that, you want to apply a log to all data. It usually pays to impute the data a bit for 0 (or close to 0) values. The factor alpha determines the imputation factor - 0 means no imputation at all.

alpha = 0.01
m = df.as_matrix()
m = alpha * np.ones_like(m) + (1 - alpha) * m
m = np.log(m)

Scaling:

m = preprocessing.scale(m)

Now, as the data is large, I had to sample it a bit for the following. Here's a sample of 1000 rows:

m = m[np.random.choice(range(m.shape[0]), 1000), :]

The mean and covariance can be found with

mu, sigma = np.mean(m), np.cov(m)

These two parameters determine the distribution completely. From here on, you can do lots of stuff, e.g., generating further values from the fit distribution.

Community
  • 1
  • 1
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
1

I don't know a solution for your coding problem. But maybe you can consider using another package. OpenTURNS is a python package with many handy things for statistics. You could use the Student distribution. It also offers a multi-variate version.

You also wrote you get 'a' t-dist but not the one you need. You could also try the check out the non-central Student distribution. If that is the case you may need to use copulas in order to create correlated marginals.

import openturns as ot
nu = 2
mu = [0.8,0.2]
sigma = [1.2,1.0]
R = ot.CorrelationMatrix(2)
# fill R as needed
print(R)
dist = ot.Student(nu, mu, sigma, R)
#this will draw PDF (for max 2 dims.)
dist.drawPDF()
Henning
  • 95
  • 1
  • 1
  • 7