0

I would like to generate pandas dataframes with simulated data.

There should be x sets of columns.
Each set corresponds to y number of columns.
Each set should have a value, a, in z number of rows. The value, a, is a float.
However, z may be different for the different sets of column sets.
The remaining columns will have another value, b, which is also a float.

I would like to write a function to generate such pandas data frames where I can specify the variables x, y, a, b and where a specific value for z can be set for the individual column sets.

Here is an example df:

data = [[0.5, 0.5, 0.1, 0.1, 0.1, 0.1], [0.1, 0.1, 0.5, 0.5, 0.1, 0.1], [0.1, 0.1, 0.1, 0.1, 0.5, 0.5]]
df = pd.DataFrame(data, columns=['set1_col1', 'set1_col2', 'set2_col1', 'set2_col2', 'set3_col1', 'set3_col2'])
df

enter image description here

But I would like to be able to specify the variables, which for the above example would be:

x = 3  #(set1, set2, set3)  
y = 2 #(col1, col2 for each set)  
a = 0.5  
z = 1 #(for all column sets)
b = 0.1 

Advice on this would be greatly appreciated!

Thanks!

peter_parker
  • 133
  • 7
  • Thanks for asking the question. I would love to help, but I cannot fully understand the description of the data you want. Perhaps either 1. A formal definition would help. Or 2. A small example table. – Kolyan1 Nov 07 '22 at 09:18
  • Also, perhaps what you are looking for can be easily done with numpy (e.g np.ones([size]) * a ) and then loading into pandas. Like so: https://stackoverflow.com/a/51308247/4386933 – Kolyan1 Nov 07 '22 at 09:20
  • Thank you so much for looking into it! I updated the question with an example. Let me know if it is more clear now or not. – peter_parker Nov 07 '22 at 09:47

1 Answers1

0

Use numpy.random.choice:

N = 5 #No of rows
x = 3  #(set1, set2, set3)  
y = 2 #(col1, col2 for each set)  
a = 0.5  
z = 1 #(for all column sets)
b = 0.1 

#names of sets
sets = [f'set{w+1}' for w in range(x)]
#names of columns
cols = [f'col{w+1}' for w in range(y)]
#MultiIndex by product
mux = pd.MultiIndex.from_product([sets, cols])
#DataFrame with default value
df = pd.DataFrame(b, index=range(N), columns=mux)

#random assign a by random index with no repeat
for c, i in zip(df.columns.levels[0], np.random.choice(df.index, z * x, replace=False)):
    df.loc[i, c] = a

df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
   set1_col1  set1_col2  set2_col1  set2_col2  set3_col1  set3_col2
0        0.1        0.1        0.1        0.1        0.1        0.1
1        0.1        0.1        0.5        0.5        0.1        0.1
2        0.5        0.5        0.1        0.1        0.1        0.1
3        0.1        0.1        0.1        0.1        0.1        0.1
4        0.1        0.1        0.1        0.1        0.5        0.5

EDIT: For consecutive random values use:

N = 6 #No of rows
x = 3  #(set1, set2, set3)  
y = 2 #(col1, col2 for each set)  
a = 0.5  
z = 2 #(for all column sets)
b = 0.1 

#names of sets
sets = [f'set{w+1}' for w in range(x)]
#names of columns
cols = [f'col{w+1}' for w in range(y)]
#MultiIndex by product
mux = pd.MultiIndex.from_product([sets, cols])
#DataFrame with default value, index is create by consecutive groups
df = pd.DataFrame(b, index=np.arange(N) // z, columns=mux)
print (df)


#random assign a by random index with no repeat
for c, i in zip(df.columns.levels[0], 
                np.random.choice(np.unique(df.index), x, replace=False)):
    df.loc[i, c] = a

df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')

df = df.reset_index(drop=True)
print (df)
   set1_col1  set1_col2  set2_col1  set2_col2  set3_col1  set3_col2
0        0.5        0.5        0.1        0.1        0.1        0.1
1        0.5        0.5        0.1        0.1        0.1        0.1
2        0.1        0.1        0.1        0.1        0.5        0.5
3        0.1        0.1        0.1        0.1        0.5        0.5
4        0.1        0.1        0.5        0.5        0.1        0.1
5        0.1        0.1        0.5        0.5        0.1        0.1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252