How can I get the "shape" of some data so I can generate similar random numbers in numpy/scipy

Question

Apologies. I know what I want to do, but am not sure what it is called and so haven't been able to search for it.

I am chasing down some anomalies in data (two reports which should add to the same total based on about 50K readings differ slightly). I therefore want to generate some random data which is the same "shape" as the data in question in order to determine whether this might be down to rounding error.

Is there a way of analysing the existing 50K or so numbers and then generating random numbers which would look pretty much the same shape on a histogram? My presumption is that numpy is probably the best tool for this, but I am open to advice.

By "shape" do you mean num rows and num columns (as in `my_array.shape`), or do you mean to fit it to a data distribution? — G. Anderson, Jun 23 '20 at 16:10
You want to generate random data where the values have the same approximate distribution as the original? — wwii, Jun 23 '20 at 16:11
Here is another possible: [Python: Generate random values from empirical distribution](https://stackoverflow.com/questions/35434363/python-generate-random-values-from-empirical-distribution) — wwii, Jun 23 '20 at 16:22
As an aside: There might be other methods to evaluate the difference you are seeing. [Measurement Error Due To Rounding](https://variation.com/measurement-error-due-to-rounding/) — wwii, Jun 23 '20 at 16:46
@PeterO. - that KDE method looks like it loses some fidelity at the edges of the distribution but it is probably sufficient. The .`.rv_histogram` / `.rvs` method below works pretty good. — wwii, Jun 23 '20 at 17:20

Nic Thibodeaux · Accepted Answer · 2020-06-23T16:40:59.653

2

You can use scipy's stats package to do this, if I'm interpreting your question correctly:

First, we generate a histogram, and measure its histogram distribution using the scipy.stats.rv_histogram() method

import scipy.stats
import numpy as np
import matplotlib.pyplot as plt

data = scipy.stats.norm.rvs(size=50000, loc=0)
hist = np.histogram(data, bins=100)
dist = scipy.stats.rv_histogram(hist)

To generate new data from this histogram, we simply call the rvs() method on the dist variable:

fake_data = dist.rvs(size=50000)

Then, we show the two distributions to prove we are getting what we expect:

plt.figure()
plt.hist(data,bins=100, alpha=0.5, label='real data')
plt.hist(fake_data,bins=100, alpha=0.5, label='fake data')
plt.legend(loc='upper right')
plt.show()

Hopefully this is what you're looking to do.

edited Jun 23 '20 at 16:40

answered Jun 23 '20 at 16:09

Nic Thibodeaux

165
2
12

Added more detail, totally misinterpreted the question! – Nic Thibodeaux Jun 23 '20 at 16:25
How will this fair if the original data doesn't approximate a normal distribution? – wwii Jun 23 '20 at 16:33
1

@wwii try it yourself, I just attempted with a log-gamma distribution and got a [similarly overlapping histogram](https://imgur.com/UXort1F) – Nic Thibodeaux Jun 23 '20 at 16:37
1

I was just wondering if you had a *feel* for it. ... My +1. – wwii Jun 23 '20 at 16:47

score 0 · Answer 2 · answered Jun 23 '20 at 16:23

0

The magic words are "inverse transform sampling" (you can generate the CDF from your histogram distribution). See this nice tutorial: https://usmanwardag.github.io/python/astronomy/2016/07/10/inverse-transform-sampling-with-python.html

answered Jun 23 '20 at 16:23

Igor Rivin

4,632
2
23
35

That is not an answer. – wwii Jun 23 '20 at 16:25
@wwii who died and appointed you the judge of what is an answer? – Igor Rivin Jun 23 '20 at 17:02
Just my opinion. Sorry if it offended. A lot of *things* here on SO seem to be concrete but if you spend time looking for bases you'll find a lot of differing opinions. – wwii Jun 23 '20 at 17:23
@wwii no offense taken (and hopefully not too much given). A lot of people seem to think that an answer is "working code", but, especially for people looking for help with homework, this seems to be not the most useful approach. – Igor Rivin Jun 23 '20 at 18:05

How can I get the "shape" of some data so I can generate similar random numbers in numpy/scipy

2 Answers2