Generate random numbers in a specific range that correlates to other column values using Pandas

Question

How do I generate random numbers in a specific range which is correlated to other column values?

I have a data frame with the column, let's say, height and I need to generate an extra column with a diameter within a range, AND the diameter should strongly correlate to the height? How to do this?

What I have done here is generate random height values in the range, different for each species. However, I could not find a solution for generating diameter values in a new column, in a specific range.

I would need the diameter to be between values 30 and 70 for 'pinus_mugo' and between values 50 and 100 for 'pinus_nigra'.

import numpy as np
import pandas as pd

points.loc[points['species']== 'pinus_mugo', 'height'] = \
    np.round(np.random.uniform(35.0, 59.0,
                               size=(len(points[points['species']== 'pinus_mugo']), 1)), 2)

points.loc[points['species']== 'pinus_nigra', 'height'] = \
    np.round(np.random.uniform(20.0, 43.0,
                               size=(len(points[points['species']== 'pinus_nigra']), 1)), 2)

I think you're looking for something like the `map` function in Processing (see [this post](https://stackoverflow.com/questions/3451553/value-remapping)). — fsimonjetz, Mar 31 '22 at 12:09

score 0 · Answer 1 · answered Apr 01 '22 at 12:07

You can define a function that gives you an output value depending on an input value, and apply this function to a column of your DataFrame:

data = {'species':  ['pinus_mugo', 'pinus_nigra'],
        'height': [45, 30] }
df = pd.DataFrame(data)

def diameter(height):
  return random.uniform(0.025*height, 0.035*height)

df['diameter']  = df['height'].apply(lambda x: diameter(x))

This is just an example assuming that the diameter is around some proportion the height, of course you can define any other random function.

You can also define a function that creates a random value within a range that depends on the species (rather than the height) and apply this to the species column:

def diameter2(species):
  min = max = 0
  if species == 'pinus_mugo': 
    min = 30 
    max = 70
  elif species == 'pinus_nigra':
    min = 50
    max = 100
  return random.randrange(min, max)

df['diameter2'] = df['species'].apply(lambda x: diameter2(x))

score 0 · Accepted Answer · answered May 25 '22 at 09:32

This did the work:

import numpy as np
from matplotlib.pyplot import scatter

# here I defined my desired values
xx = np.array([5, 35])
yy = np.array([8, 90])

means = [xx.mean(), yy.mean()]
stds = [xx.std() / 3, yy.std() / 3]
corr = 0.90         # correlation
covs = [[stds[0]**2          , stds[0]*stds[1]*corr],
        [stds[0]*stds[1]*corr,           stds[1]**2]]

m = np.random.multivariate_normal(means, covs, 760).T
scatter(m[0], m[1])

dataset = pd.DataFrame({'height': m[0], 'diameter': m[1]})
dataset['index'] = dataset.index
reference['index'] = reference.index

result = pd.concat([reference, dataset], axis=1)
scatter(result['height'], result['diameter'])

Result:

Generate random numbers in a specific range that correlates to other column values using Pandas

2 Answers2