2

Apologies. I know what I want to do, but am not sure what it is called and so haven't been able to search for it.

I am chasing down some anomalies in data (two reports which should add to the same total based on about 50K readings differ slightly). I therefore want to generate some random data which is the same "shape" as the data in question in order to determine whether this might be down to rounding error.

Is there a way of analysing the existing 50K or so numbers and then generating random numbers which would look pretty much the same shape on a histogram? My presumption is that numpy is probably the best tool for this, but I am open to advice.

TimGJ
  • 1,584
  • 2
  • 16
  • 32
  • 1
    By "shape" do you mean num rows and num columns (as in `my_array.shape`), or do you mean to fit it to a data distribution? – G. Anderson Jun 23 '20 at 16:10
  • 1
    You want to generate random data where the values have the same approximate distribution as the original? – wwii Jun 23 '20 at 16:11
  • Here is another possible: [Python: Generate random values from empirical distribution](https://stackoverflow.com/questions/35434363/python-generate-random-values-from-empirical-distribution) – wwii Jun 23 '20 at 16:22
  • As an aside: There might be other methods to evaluate the difference you are seeing. [Measurement Error Due To Rounding](https://variation.com/measurement-error-due-to-rounding/) – wwii Jun 23 '20 at 16:46
  • @PeterO. - that KDE method looks like it loses some fidelity at the edges of the distribution but it is probably sufficient. The .`.rv_histogram` / `.rvs` method below works pretty good. – wwii Jun 23 '20 at 17:20

2 Answers2

2

You can use scipy's stats package to do this, if I'm interpreting your question correctly:

First, we generate a histogram, and measure its histogram distribution using the scipy.stats.rv_histogram() method

import scipy.stats
import numpy as np
import matplotlib.pyplot as plt

data = scipy.stats.norm.rvs(size=50000, loc=0)
hist = np.histogram(data, bins=100)
dist = scipy.stats.rv_histogram(hist)

To generate new data from this histogram, we simply call the rvs() method on the dist variable:

fake_data = dist.rvs(size=50000)

Then, we show the two distributions to prove we are getting what we expect:

plt.figure()
plt.hist(data,bins=100, alpha=0.5, label='real data')
plt.hist(fake_data,bins=100, alpha=0.5, label='fake data')
plt.legend(loc='upper right')
plt.show()

enter image description here

Hopefully this is what you're looking to do.

Nic Thibodeaux
  • 165
  • 2
  • 12
0

The magic words are "inverse transform sampling" (you can generate the CDF from your histogram distribution). See this nice tutorial: https://usmanwardag.github.io/python/astronomy/2016/07/10/inverse-transform-sampling-with-python.html

Igor Rivin
  • 4,632
  • 2
  • 23
  • 35
  • That is not an answer. – wwii Jun 23 '20 at 16:25
  • @wwii who died and appointed you the judge of what is an answer? – Igor Rivin Jun 23 '20 at 17:02
  • Just my opinion. Sorry if it offended. A lot of *things* here on SO seem to be concrete but if you spend time looking for bases you'll find a lot of differing opinions. – wwii Jun 23 '20 at 17:23
  • @wwii no offense taken (and hopefully not too much given). A lot of people seem to think that an answer is "working code", but, especially for people looking for help with homework, this seems to be not the most useful approach. – Igor Rivin Jun 23 '20 at 18:05