In a pandas Dataframe I have subgroups with different and big number of rows. I want to reduce the number of rows for preliminary analysis while ensuring that the data is still representative in the whole range.
I ran a simulation with 2-factors or parameters ('A','B'
), and 2-levels or values per factor ('A1','A2','B1','B2'
). Each simulation corresponds to a combination of value of 'A','B'
. The simulation stops after the counter is above a defined number ('35' in the example below).
For each simulation, the counter and its increase are different. And in each step a value 'eval'
is summarized from the simulation.
The example below show a sample of the simulation's result. Now, the simulation actually runs for much longer (let's say for the example that until it is above 10000), and it takes hours to graph the evolution of the eval
values in my preliminary analysis.
This code generates a sample of the results of the simulation:
import pandas as pd
import numpy as np
columns = ['FactorA', 'FactorB', 'step']
data = [['A1', 'B1', 8], ['A1', 'B1', 13], ['A1', 'B1', 18], ['A1', 'B1', 23], ['A1', 'B1', 28], ['A1', 'B1', 33], ['A1', 'B1', 38],
['A1', 'B2', 7], ['A1', 'B2', 13],['A1', 'B2', 19],['A1', 'B2', 25],['A1', 'B2', 31],['A1', 'B2', 37],
['A2', 'B1', 6], ['A2', 'B1', 14],['A2', 'B1', 22],['A2', 'B1', 30],['A2', 'B1', 38],
['A2', 'B2', 10], ['A2', 'B2', 12],['A2', 'B2', 14],['A2', 'B2', 16],['A2', 'B2', 18],['A2', 'B2', 20],['A2', 'B2', 22],['A2', 'B2', 24],['A2', 'B2', 26],['A2', 'B2', 28],['A2', 'B2', 30],['A2', 'B2', 32],['A2', 'B2', 34],['A2', 'B2', 36]
]
df = pd.DataFrame(data, columns=columns)
df['eval'] = np.random.randint(1, 6, df.shape[0])
I tried this but while it reduces the data points, it doesn't balance the number of data points per simulation:
df_reduced = df.iloc[::2]
Also tried:
df_reduced = df.sample(n=int(len(df)/6))
but it also doesn't balance the amount of data points per simulation.
I want a DataFrame in which each subgroup has the same number of rows.
To ensure that the selection or sampling is balanced, I want that the slicing for each subgroup using .iloc
considers steps that ensure selecting 'n'
members per subgroup.
It would be great but not necessary to include the first and last row of each subgroup.