2

is there a way in python to generate random data based on the distribution of the alreday existing data?

Here are the statistical parameters of my dataset:

Data
count   209.000000
mean    1.280144
std     0.374602
min     0.880000
25%     1.060000
50%     1.150000
75%     1.400000
max     4.140000

as it is no normal distribution it is not possible to do it with np.random.normal. Any Ideas?

Distribution

Thank you.

Edit: Performing KDE:

from sklearn.neighbors import KernelDensity
# Gaussian KDE
kde = KernelDensity(kernel='gaussian', bandwidth=0.525566).fit(data['y'].to_numpy().reshape(-1, 1))
sns.distplot(kde.sample(2400))

KDE

qwertz
  • 619
  • 1
  • 7
  • 15
  • 1
    Take a look at https://stats.stackexchange.com/questions/215938/generate-synthetic-data-to-match-sample-data also https://towardsdatascience.com/synthetic-data-generation-a-must-have-skill-for-new-data-scientists-915896c0c1ae – Vishnudev Krishnadas Mar 18 '20 at 11:30
  • 2
    You have 2 options. (1) identify the distribution (chi-square?) and generate it. (2) Do a box cox, generate normal, and then do [reverse](https://stackoverflow.com/questions/26391454/reverse-box-cox-transformation) – Sergey Bushmanov Mar 18 '20 at 11:32
  • for option (1): it could also be weibull. how can I be sure about it?. I will try option (2) first. ty – qwertz Mar 18 '20 at 11:44

1 Answers1

1

In general, real-world data doesn't exactly follow a "nice" distribution like the normal or Weibull distributions.

Similarly to machine learning, there are generally two steps to sampling from a distribution of data points:

  • Fit a data model to the data.

  • Then, predict a new data point based on that model, with the help of randomness.

There are several ways to estimate the distribution of data and sample from that estimate:

  • Kernel density estimation.
  • Gaussian mixture models.
  • Histograms.
  • Regression models.
  • Other machine learning models.

In addition, methods such as maximum likelihood estimation make it possible to fit a known distribution (such as the normal distribution) to data, but the estimated distribution is generally rougher than with kernel density estimation or other machine learning models.

See also my section "Random Numbers from a Distribution of Data Points".

Peter O.
  • 32,158
  • 14
  • 82
  • 96
  • But how shall i perform a Regression model on a single parameter? – qwertz Mar 18 '20 at 12:37
  • Regression models apply to data that are inputs and outputs (e.g., sales figures for a particular month), which is not the kind of data you've shown here. For your problem, ignore the advice on regression models. Perhaps the most promising solution for your data is [kernel density estimation](http://scikit-learn.org/stable/modules/density.html), which scikit-learn supports. – Peter O. Mar 18 '20 at 12:45
  • but kde also needs to parameters right? My fit needs a second parameter and i have only one – qwertz Mar 18 '20 at 13:09
  • What do you mean by "parameters"? – Peter O. Mar 18 '20 at 13:14
  • I see what you mean now: Kernel density estimation requires a bandwidth parameter, which is roughly the standard deviation of the data points. See also [this blog post](https://web.archive.org/web/20160501200206/http://mark-kay.net/2013/12/24/kernel-density-estimation) for a way to determine this bandwidth parameter using scikit-learn. – Peter O. Mar 18 '20 at 13:18
  • The problem that comes with generating my data this way is: i get values that are below 1 which is not possible for the physical parameter the values are representing. Is there a way to give limitations to a kde estimation? See original post above :) – qwertz Mar 18 '20 at 13:36
  • You can exclude values below 1 with the following (I didn't test this, since I don't know whether `kde.sample` returns a NumPy array): `s = kde.sample(2400); s = s[s >= 1]`. – Peter O. Mar 18 '20 at 13:44