2

I have all the statistical details found in pandas DataFrame.describe() method like count, mean, standard deviation, min, max, etc. I need to generate the dataset from these details. Is there any app or python code which can do the work. I want to generate any random dataset having these statistics

Count 263
mean 35.790875
std 24.874763
min 0.0000000
25% 16.000000
50% 32.000000
75% 49.000000
max 99.000000

Levi
  • 79
  • 1
  • 7
  • Can you add all of the statistical details from `describe` to your question? – Chris Adams May 14 '20 at 10:09
  • I can't embed images as i am new user, i have listed all the details as text. – Levi May 14 '20 at 10:20
  • Levi, I see you're new to SO. If you feel an answer solved the problem, please mark it as 'accepted' by clicking the green check mark at the left of the answer. This helps keep the focus on older SO questions which still don't have answers. Of course, if you are waiting for other answers that's perfectly fine. – Guillermo Mosse May 20 '20 at 10:01

3 Answers3

1

Hi and welcome to the forum! This is an excellent question, I loved it.

I think it's nontrivial in the general case. You could create a dataset that has the correct count, mean, min, and percentiles, but the standard deviation is quite a bit tricky.

Here is a way of getting a dataset that fullfill's your example requirements. It can be adapted for the general case, but expect many 'border cases'. The basic idea is satisfy each requirement from easiest to hardest, taking care of not invalidating the previous ones as you advance.

from numpy import std
import math

COUNT = 263
MEAN = 35.790875
STD = 24.874763
MIN = 0
P25 = 16
P50 = 32
P75 = 49
MAX = 99

#Positions of the percentiles
P25_pos = floor(0.25 * COUNT) - 1
P50_pos = floor(0.5 * COUNT) - 1
P75_pos = floor(0.75 * COUNT) - 1
MAX_pos = COUNT -1

#Count requirement
v = [0] * COUNT

#Min requirement
v[0] = MIN

#Max requirement
v[MAX_pos] = MAX

#Good, we already satisfied the easiest 3 requirements. Notice that these are deterministic,
#there is only one way to satisfy them

#This will satisfy the 25th percentile requirement
for i in range(1, P25_pos):
    #We could also interpolate the value from P25 to P50, even adding a bit of randomness.
    v[i] = P25
v[P25_pos] = P25

#Actually pandas does some linear interpolation (https://stackoverflow.com/questions/39581893/pandas-find-percentile-stats-of-a-given-column)
#when calculating percentiles but we can simulate that by letting the next value be also P25
if P25_pos + 1 != P50_pos:
    v[P25_pos + 1] = P25

#We do something extremely similar with the other percentiles
for i in range(P25_pos + 3, P50_pos):
    v[i] = P50

v[P50_pos] = P50
if P50_pos + 1 != P75_pos:
    v[P50_pos + 1] = P50

for i in range(P50_pos + 1, P75_pos):
    v[i] = P50

v[P75_pos] = P75
if P75_pos + 1 != v[MAX_pos]:
    v[P75_pos + 1] = P75

for i in range(P75_pos + 1, MAX_pos):
    v[i] = P75

#This will give us correct 25%, 50%, 75%, min, max, and count values. We are still missing MEAN and std.

#We are getting a mean of 24.84, and we need to increase it a little bit to get 35.790875. So we manually teak the numbers between the 75th and 100th percentile.
#That is, numbers between pos 197 and 261.
#This would be much harder to do automatically instead of with a hardcoded example.

#This increases the average a bit, but not enough!
for i in range(P75_pos + 1, 215):
    v[i] = MAX


#We solve an equation to get the necessary value for v[256] for the mean to be what we want to be.
#This equation comes from the formula for the average: AVG = SUM/COUNT. We simply clear the variable v[215] from that formula.
new_value = MEAN * COUNT - sum(v) + v[215]

#The new value for v[215] should be between P75 and MAX so we don't invalidate the percentiles.
assert(P75 <= new_value)
assert(new_value <= MAX)

v[256] = new_value


#Now comes the tricky part: we need the correct std. As of now, it is 20.916364, and it should be higher: 24.874763
#For this, as we don't want to change the average, we are going to change values in pairs,
#as we need to compensate each absolute increase with an absolute decrease

for i in range(1, P25_pos - 3):
    #We can move the values between the 0th and 25th percentile between 0 and 16
    v[i] -= 12

    #Between the 25th and 50th percentile, we can move the values between 32 and 49
    v[P25_pos + 1 + i] += 12


#As of now, this got us a std of 24.258115. We need it to be a bit higher: 24.874763

#The trick we did before of imposing a value for getting the correct mean is much harder to do here,
#because the equation is much more complicated

#So we'll just approximate the value intead with a while loop. There are faster ways than this, see: https://en.wikipedia.org/wiki/Root-finding_algorithms
current_std = math.sqrt(sum([(val - MEAN)**2 for val in v])/(COUNT - 1))
while 24.874763 - current_std >= 10e-5:
    for i in range(1, P25_pos - 3):
        #We can move the values between the 0th and 25th percentile between 0 and 16
        v[i] -= 0.00001

        #Between the 25th and 50th percentile, we can move the values between 32 and 49
        v[P25_pos + 1 + i] += 0.00001
    current_std = math.sqrt(sum([(val - MEAN)**2 for val in v])/(COUNT - 1))

#We tweak some further decimal points now
while 24.874763 - current_std >= 10e-9:
    v[1] += 0.0001

    #Between the 25th and 50th percentile, we can move the values between 32 and 49
    v[P25_pos + 2] -= 0.0001
    current_std = math.sqrt(sum([(val - MEAN)**2 for val in v])/(COUNT - 1))


df = pd.DataFrame({'col':v})

#Voila!
df.describe()

Output:

    col
count   263.000000
mean    35.790875
std     24.874763
min     0.000000
25%     16.000000
50%     32.000000
75%     49.000000
max     99.000000
Guillermo Mosse
  • 462
  • 2
  • 14
  • I want to generate the original dataset from the given statistical information – Levi May 14 '20 at 09:55
  • 1
    that's impossible. The statistical information is a reduction from the original dataset – Guillermo Mosse May 14 '20 at 09:55
  • 1
    you can of course simulate a dataset that has those characteristics; is that what you want? – Guillermo Mosse May 14 '20 at 09:56
  • 1
    yeah but how to find those random numbers who has this mean, standard deviation etc? – Levi May 14 '20 at 09:57
  • i know excel can do this but it don't have all the functions – Levi May 14 '20 at 10:00
  • Updated my answer with a correct solution :-). Will add another one later with a different, more robust approach. But this code works. If you believe it answers your question, please don't forget of pressing the checkmark button at the left of the answer. – Guillermo Mosse May 14 '20 at 11:36
  • 1
    Great! This really worked. Thanks. But it would have been better if the numbers were more random – Levi May 14 '20 at 13:46
  • I agree. You can add some small random noise in the non-relevant percentile values, but you need to do it before adjusting for the mean and std. Also it's perfectly fine to reshuffle the dataset. – Guillermo Mosse May 14 '20 at 13:51
  • You can use np.random.normal(mean, std) to add the noise, for example. – Guillermo Mosse May 14 '20 at 13:52
0

I just thought of another approach that makes numbers not look so artificial. It's significantly slower, so only use it if you don't care about the dataset being small. Here's an example with a dataset of size 40, but you can change the value of the COUNT variable if you want to produce a larger dataset. Also, this code can be adapted for other value requirements - just change the header.

We start the same way as in my previous answer, satisfying all requirements except from MEAN and STD:

from math import floor


lr = 10e-6

COUNT = 40.0
MEAN = 35.790875
STD = 24.874763
MIN = 0.0
P25 = 16.0
P50 = 32.0
P75 = 49.0
MAX = 99.0


#Positions of the percentiles
P25_pos = floor(0.25 * COUNT) - 1
P50_pos = floor(0.5 * COUNT) - 1
P75_pos = floor(0.75 * COUNT) - 1
MAX_pos = int(COUNT -1)

#Count requirement
X = [0.0] * int(COUNT)

#Min requirement
X[0] = MIN

#Max requirement
X[MAX_pos] = MAX

#Good, we already satisfied the easiest 3 requirements. Notice that these are deterministic,
#there is only one way to satisfy them

#This will satisfy the 25th percentile requirement
for i in range(1, P25_pos):
    #We could also interpolate the value from P25 to P50, even adding a bit of randomness.
    X[i] = 0.0
X[P25_pos] = P25

#Actually pandas does some linear interpolation (https://stackoverflow.com/questions/39581893/pandas-find-percentile-stats-of-a-given-column)
#when calculating percentiles but we can simulate that by letting the next value be also P25
if P25_pos + 1 != P50_pos:
    X[P25_pos + 1] = P25

#We do something extremely similar with the other percentiles
for i in range(P25_pos + 2, P50_pos):
    X[i] = P25

X[P50_pos] = P50
if P50_pos + 1 != P75_pos:
    X[P50_pos + 1] = P50

for i in range(P50_pos + 1, P75_pos):
    X[i] = P50

X[P75_pos] = P75
if P75_pos + 1 != X[MAX_pos]:
    X[P75_pos + 1] = P75

for i in range(P75_pos + 2, MAX_pos):
    X[i] = P75

But then, we treat this as a (constrained) gradient descent problem: we want to minimize the difference between our MEAN and STD and the expected MEAN and STD, while keeping the quartiles values. The values we want to learn are the ones of our dataset - we exclude the quartiles, of course, because we already have a contraint for what those values must be.

def std(X):
    return sum([(val - sum(X)/len(X))**2 for val in X])/(len(X) - 1)

#This function measures the difference between our STD and MEAN and the expected values
def cost(X):
    m = sum(X) / len(X)
    return ((sum([(val - m)**2 for val in X])/(len(X) - 1) - STD**2)) ** 2 + (m - MEAN)**4

#You have to install this library
import autograd.numpy as anp  # Thinly-wrapped numpy
from autograd import grad     #for automatically calculating gradients of functions

#This is the derivative of the cost and it is used in the gradient descent to update the values of the dataset
grad_cost = grad(cost)

def learn(lr, epochs):
    for j in range(0, epochs):
        gr = []
        for i in range(len(X)):
            gr.append(grad_cost(X)[i] * lr)

        for i in range(1, P25_pos):
            if X[i] - gr[i] >= MIN and X[i] - gr[i] <= P25:
                X[i] -= gr[i]

        for i in range(P25_pos+2, P50_pos):
            if X[i] - gr[i] >= P25 and X[i] - gr[i] <= P50:
                X[i] -= gr[i]

        for i in range(P50_pos + 2, P75_pos):
            if X[i] - gr[i] >= P50 and X[i] - gr[i] <= P75:
                X[i] -= gr[i]

        for i in range(P75_pos + 2, MAX_pos):
            if X[i] - gr[i] >= P75 and X[i] - gr[i] <= MAX:
                X[i] -= gr[i]

        if j % 100 == 0:
            print(cost(X))

        #if j % 200 == 0:
        #    print(gr)

    print(cost(X))
    print(X)

You can now use the learn(learning_rate, epochs) function to do the gradient descent. I used learning_rates between 10e-7 and 10e-4.

For this case, after learning for quite some time (around 100K epochs, which took about an hour), I got an STD of 24.871 (compare it to the actual value of 24.874) and a mean of 31.730 (compare it to the actual value of 35.790). These are the results I got:

col
count   40.000000
mean    31.730694
std     24.871651
min     0.000000
25%     16.000000
50%     32.000000
75%     49.000000
max     99.000000

with the following sorted column values:

[0.0, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 16.0, 16.0, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 32.0, 32.0, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 49.0, 49.0, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 99.0]

These results can definitely be improved with more training. I will update the answer when I get better results.

Guillermo Mosse
  • 462
  • 2
  • 14
0

I also have a similar problem, but not that complex. For your information.

def simulate_data(COUNT,MIN,P25,P50,P75,MAX):
    c = np.round(np.random.normal(0.5*COUNT, 0.25 * COUNT, COUNT),0)
    y = [MIN,P25,P50,P75,MAX]
    x = [min(c),np.percentile(c,25),np.percentile(c,50),np.percentile(c,75),max(c)]
    y_I = np.interp(c, x, y)
    return y_I
莊佩琪
  • 11
  • 1