Python - Generate distribution

Question

Generation of distribution in python using scientific libraries

Couldn't you just use the MinMaxScaler? – Kordi Mar 02 '16 at 08:41 — Kordi, Mar 02 '16 at 08:41

Kordi · Answer 1 · 2016-03-02T09:00:09.533

1

At the moment couldn't try your code but you could limit the input of the scaler like this

min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
data_scaled = min_max_scaler.fit_transform([data])

1 is here the wrong value, but it should only show the concept. Link to the Documentation http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler

edited Mar 02 '16 at 09:00

answered Mar 02 '16 at 08:43

Kordi

2,405
1
14
13

Did you look at http://stackoverflow.com/questions/33162871/python-scikit-learn-svm-classifier-valueerror-found-array-with-dim-3-expected ? – Kordi Mar 02 '16 at 09:33

score 1 · Accepted Answer · edited May 23 '17 at 12:15

1

Let's start with

import pandas as pd
from scipy.stats import norm
from sklearn import preprocessing
from sklearn import mixture
import numpy as np

df = pd.read_csv('test2.csv')

Cleaning up:

df.dropna(inplace=True)

Following that, you want to apply a log to all data. It usually pays to impute the data a bit for 0 (or close to 0) values. The factor alpha determines the imputation factor - 0 means no imputation at all.

alpha = 0.01
m = df.as_matrix()
m = alpha * np.ones_like(m) + (1 - alpha) * m
m = np.log(m)

Scaling:

m = preprocessing.scale(m)

Now, as the data is large, I had to sample it a bit for the following. Here's a sample of 1000 rows:

m = m[np.random.choice(range(m.shape[0]), 1000), :]

The mean and covariance can be found with

mu, sigma = np.mean(m), np.cov(m)

These two parameters determine the distribution completely. From here on, you can do lots of stuff, e.g., generating further values from the fit distribution.

edited May 23 '17 at 12:15

Community

1
1

answered Mar 02 '16 at 09:02

Ami Tavory

74,578
11
141
185

@usert4jju7 Let's do it step by step. Which step is giving you the error you mentioned? – Ami Tavory Mar 02 '16 at 09:59
@usert4jju7 I've been a bit busy - will be able to look at it later. – Ami Tavory Mar 05 '16 at 12:37
@usert4jju7 So the thing is that the matrix contains negative values, which is why you can't take the log + scale (the log is undefined). Are you sure you want to take a log? Why are you doing that? – Ami Tavory Mar 05 '16 at 18:41
I am - it's a huge huge file. BTW, we should switch to chat. – Ami Tavory Mar 05 '16 at 18:54

score 1 · Answer 3 · answered Mar 02 '16 at 09:08

I don't know a solution for your coding problem. But maybe you can consider using another package. OpenTURNS is a python package with many handy things for statistics. You could use the Student distribution. It also offers a multi-variate version.

You also wrote you get 'a' t-dist but not the one you need. You could also try the check out the non-central Student distribution. If that is the case you may need to use copulas in order to create correlated marginals.

import openturns as ot
nu = 2
mu = [0.8,0.2]
sigma = [1.2,1.0]
R = ot.CorrelationMatrix(2)
# fill R as needed
print(R)
dist = ot.Student(nu, mu, sigma, R)
#this will draw PDF (for max 2 dims.)
dist.drawPDF()

Python - Generate distribution

3 Answers3

Linked