Altering distribution of one dataset to match another dataset

Question

I have 2 datasets, one of modeled (artificial) data and another with observed data. They have slightly different statistical distributions and I want to force the modeled data to match the observed data distribution in the spread of the data. In other words, I need the modeled data to better represent the tails of the observed data. Here's an example.

model <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,
38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40,
42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30,
36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)

observed <- c(39.50,44.79,58.28,56.04,53.40,59.25,48.49,54.51,35.38,39.98,28.00,
28.49,27.74,51.92,42.53,44.91,44.91,40.00,41.51,47.92,36.98,53.40,
42.26,42.89,43.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
52.81,36.87,47.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
51.34,43.37,51.15,42.77,42.88,44.26,27.14,39.31,24.80,12.62,30.30,
34.39,25.60,38.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
34.65,39.54,47.70,38.11,43.05,29.95,22.48,24.63,35.33,41.34)

summary(model)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
16.62   36.98   40.38   40.28   44.91   54.15 

summary(observed)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
12.62   35.54   42.58   41.10   47.76   59.2

How can I force the model data to have the variability that the observed has in R?

This seems more like a mathematical discussion then a programming one. Perhaps it would do well on [Cross Validated](http://stats.stackexchange.com)? (Also, "higher frequency" has no context with the current data. You may need to alter your sample to correctly reflect the time-series nature.) — r2evans, Aug 25 '16 at 18:20
Additionally, you should include the distribution type and parameters of the `model` distribution, unless it is just another empirical distribution, in which case you need to be clear that you have two empirical distributions. — r2evans, Aug 25 '16 at 18:22
@r2evans good point, I got rid of the extra question. I'm unsure of what the distribution type and parameters are for the `model` distribution, but it is modeled data — j_simskii, Aug 25 '16 at 18:33
If it is modeled data, there is nothing that can be done to adjust the variance without discarding the fact that it is a "real model". That is, if you treat it as empirical data and try [distribution fitting](https://en.wikipedia.org/wiki/Distribution_fitting), then you'll be creating a new "model" with whatever variance modification you need. — r2evans, Aug 25 '16 at 18:40

eipi10 · Accepted Answer · 2016-08-25T19:42:31.647

Are you just modeling the distribution of observed? If so, you could generate a kernel density estimate from the observations and then resample from that modeled density distribution. For example:

library(ggplot2)

First we generate a density estimate from the observed values. This is our model of the distribution of the observed values. adjust is a parameter that determines the bandwidth. The default value is 1. Smaller values result in less smoothing (i.e., a density estimate that more closely follows small-scale structure in the data):

dens.obs = density(observed, adjust=0.8)

Now, resample from the density estimate to get the modeled values. We set prob=dens.obs$y so that the probability of a value in dens.obs$x being chosen is proportional to its modeled density.

set.seed(439)
resample.obs = sample(dens.obs$x, 1000, replace=TRUE, prob=dens.obs$y)

Put observed and modeled values in a data frame in preparation for plotting:

dat = data.frame(value=c(observed,resample.obs), 
                 group=rep(c("Observed","Modeled"), c(length(observed),length(resample.obs))))

The ECDF (empirical cumulative distribution function) plot below shows that sampling from the kernel density estimate gives samples with a distribution similar to the observed data:

ggplot(dat, aes(value, fill=group, colour=group)) +
  stat_ecdf(geom="step") +
  theme_bw()

You can also plot the density distribution of the observed data and the values sampled from the modeled distribution (using the same value for the adjust parameter as we used above).

ggplot(dat, aes(value, fill=group, colour=group)) +
  geom_density(alpha=0.4, adjust=0.8) +
  theme_bw()

score 1 · Answer 2 · edited May 23 '17 at 10:28

1

Have a look at this answer How to generate distributions given, mean, SD, skew and kurtosis in R?.

It discusses use of the SuppDists package. This package permits you to create a distribution by creating a set of parameters based on the Johnson system of distributions.

edited May 23 '17 at 10:28

Community

1
1

answered Aug 25 '16 at 20:18

Edward Carney

1,372
9
7

Altering distribution of one dataset to match another dataset

2 Answers2

Linked