I just thought of another approach that makes numbers not look so artificial. It's significantly slower, so only use it if you don't care about the dataset being small. Here's an example with a dataset of size 40, but you can change the value of the COUNT variable if you want to produce a larger dataset. Also, this code can be adapted for other value requirements - just change the header.
We start the same way as in my previous answer, satisfying all requirements except from MEAN and STD:
from math import floor
lr = 10e-6
COUNT = 40.0
MEAN = 35.790875
STD = 24.874763
MIN = 0.0
P25 = 16.0
P50 = 32.0
P75 = 49.0
MAX = 99.0
#Positions of the percentiles
P25_pos = floor(0.25 * COUNT) - 1
P50_pos = floor(0.5 * COUNT) - 1
P75_pos = floor(0.75 * COUNT) - 1
MAX_pos = int(COUNT -1)
#Count requirement
X = [0.0] * int(COUNT)
#Min requirement
X[0] = MIN
#Max requirement
X[MAX_pos] = MAX
#Good, we already satisfied the easiest 3 requirements. Notice that these are deterministic,
#there is only one way to satisfy them
#This will satisfy the 25th percentile requirement
for i in range(1, P25_pos):
#We could also interpolate the value from P25 to P50, even adding a bit of randomness.
X[i] = 0.0
X[P25_pos] = P25
#Actually pandas does some linear interpolation (https://stackoverflow.com/questions/39581893/pandas-find-percentile-stats-of-a-given-column)
#when calculating percentiles but we can simulate that by letting the next value be also P25
if P25_pos + 1 != P50_pos:
X[P25_pos + 1] = P25
#We do something extremely similar with the other percentiles
for i in range(P25_pos + 2, P50_pos):
X[i] = P25
X[P50_pos] = P50
if P50_pos + 1 != P75_pos:
X[P50_pos + 1] = P50
for i in range(P50_pos + 1, P75_pos):
X[i] = P50
X[P75_pos] = P75
if P75_pos + 1 != X[MAX_pos]:
X[P75_pos + 1] = P75
for i in range(P75_pos + 2, MAX_pos):
X[i] = P75
But then, we treat this as a (constrained) gradient descent problem: we want to minimize the difference between our MEAN and STD and the expected MEAN and STD, while keeping the quartiles values. The values we want to learn are the ones of our dataset - we exclude the quartiles, of course, because we already have a contraint for what those values must be.
def std(X):
return sum([(val - sum(X)/len(X))**2 for val in X])/(len(X) - 1)
#This function measures the difference between our STD and MEAN and the expected values
def cost(X):
m = sum(X) / len(X)
return ((sum([(val - m)**2 for val in X])/(len(X) - 1) - STD**2)) ** 2 + (m - MEAN)**4
#You have to install this library
import autograd.numpy as anp # Thinly-wrapped numpy
from autograd import grad #for automatically calculating gradients of functions
#This is the derivative of the cost and it is used in the gradient descent to update the values of the dataset
grad_cost = grad(cost)
def learn(lr, epochs):
for j in range(0, epochs):
gr = []
for i in range(len(X)):
gr.append(grad_cost(X)[i] * lr)
for i in range(1, P25_pos):
if X[i] - gr[i] >= MIN and X[i] - gr[i] <= P25:
X[i] -= gr[i]
for i in range(P25_pos+2, P50_pos):
if X[i] - gr[i] >= P25 and X[i] - gr[i] <= P50:
X[i] -= gr[i]
for i in range(P50_pos + 2, P75_pos):
if X[i] - gr[i] >= P50 and X[i] - gr[i] <= P75:
X[i] -= gr[i]
for i in range(P75_pos + 2, MAX_pos):
if X[i] - gr[i] >= P75 and X[i] - gr[i] <= MAX:
X[i] -= gr[i]
if j % 100 == 0:
print(cost(X))
#if j % 200 == 0:
# print(gr)
print(cost(X))
print(X)
You can now use the learn(learning_rate, epochs) function to do the gradient descent. I used learning_rates between 10e-7 and 10e-4.
For this case, after learning for quite some time (around 100K epochs, which took about an hour), I got an STD of 24.871 (compare it to the actual value of 24.874) and a mean of 31.730 (compare it to the actual value of 35.790). These are the results I got:
col
count 40.000000
mean 31.730694
std 24.871651
min 0.000000
25% 16.000000
50% 32.000000
75% 49.000000
max 99.000000
with the following sorted column values:
[0.0, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 1.6232547073078982, 16.0, 16.0, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 17.870937400371687, 32.0, 32.0, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 38.50321491745568, 49.0, 49.0, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 64.03106466400027, 99.0]
These results can definitely be improved with more training. I will update the answer when I get better results.