3

In a pandas Dataframe I have subgroups with different and big number of rows. I want to reduce the number of rows for preliminary analysis while ensuring that the data is still representative in the whole range.

I ran a simulation with 2-factors or parameters ('A','B'), and 2-levels or values per factor ('A1','A2','B1','B2'). Each simulation corresponds to a combination of value of 'A','B'. The simulation stops after the counter is above a defined number ('35' in the example below). For each simulation, the counter and its increase are different. And in each step a value 'eval' is summarized from the simulation.

The example below show a sample of the simulation's result. Now, the simulation actually runs for much longer (let's say for the example that until it is above 10000), and it takes hours to graph the evolution of the eval values in my preliminary analysis.

This code generates a sample of the results of the simulation:

import pandas as pd
import numpy as np

columns = ['FactorA', 'FactorB', 'step']
data = [['A1', 'B1', 8], ['A1', 'B1', 13], ['A1', 'B1', 18], ['A1', 'B1', 23], ['A1', 'B1', 28], ['A1', 'B1', 33], ['A1', 'B1', 38],
        ['A1', 'B2', 7], ['A1', 'B2', 13],['A1', 'B2', 19],['A1', 'B2', 25],['A1', 'B2', 31],['A1', 'B2', 37],
        ['A2', 'B1', 6], ['A2', 'B1', 14],['A2', 'B1', 22],['A2', 'B1', 30],['A2', 'B1', 38],
        ['A2', 'B2', 10], ['A2', 'B2', 12],['A2', 'B2', 14],['A2', 'B2', 16],['A2', 'B2', 18],['A2', 'B2', 20],['A2', 'B2', 22],['A2', 'B2', 24],['A2', 'B2', 26],['A2', 'B2', 28],['A2', 'B2', 30],['A2', 'B2', 32],['A2', 'B2', 34],['A2', 'B2', 36]
       ]
df = pd.DataFrame(data, columns=columns)
df['eval'] = np.random.randint(1, 6, df.shape[0])

I tried this but while it reduces the data points, it doesn't balance the number of data points per simulation:

df_reduced = df.iloc[::2]

Also tried:

df_reduced = df.sample(n=int(len(df)/6))

but it also doesn't balance the amount of data points per simulation.

I want a DataFrame in which each subgroup has the same number of rows. To ensure that the selection or sampling is balanced, I want that the slicing for each subgroup using .iloc considers steps that ensure selecting 'n' members per subgroup. It would be great but not necessary to include the first and last row of each subgroup.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
EloyRD
  • 304
  • 3
  • 12

2 Answers2

3

Look at the following instruction in your post:

df.sample(n=int(len(df)))

What is weird here:

  • int is not needed (len has already int type).
  • len(df) selects all rows from df, so your "sample" created this way contains full original df, only shuffling the order. It this what you want?

And as far as group balancing is concerned:

Decide how you want to keep the balance:

  • Option 1: equal number of sample rows from each group.
  • Option 2: equal fraction of samples from each group.

When you make up your mind:

  • run groupby of your source DataFrame on the group criterion,
  • applying a function returning respective sample from the current group.

Example: If you want to select from df a sample of 2 rows from each group (option 1), run:

df.groupby(['FactorA', 'FactorB']).apply(lambda grp: grp.sample(n=2))

If you want to return to the original (single-level) index, add:

.reset_index(level=[0, 1], drop=True)

to the above instruction.

If you prefer option 2 (a fraction), change n=... to frac=....

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • I just edited my post to show how I used `df.sample` to reduce my data rows. Thanks for pointing to the `groupby` function. As I want to get equal data points from each simulation, I now have a way to do it. Just a quick doubt after reading some documentation. If I have two subgroups or simulations with the same quantity and sequence of steps, ¿is there a way to ensure that the same steps are selected? That was my intention with using `iloc` – EloyRD Jun 23 '19 at 18:45
  • If you want to select rows with the same steps from **2** subgroups, I'm afraid there is no *Pandas* function to make such selection. Maybe you should write your own function to do it (find common set of steps and select rows with just these steps). In case of **two** groups to choose from, I can figure out how to do it, but I'm not sure how to do it if there are more groups (and you have actually **four** of them). – Valdi_Bo Jun 23 '19 at 19:12
1

After the answer from @Valdi_Bo, I got to the page on Group By: split-apply-combine and mixed the ideas in here, here, here, here, and here to get a non-elegant but working solution.

For the dataframe in the example, we can count the rows per group:

grouped = df.groupby(['FactorA','FactorB'])
grouped.size()

This produces:

FactorA  FactorB
A1       B1          7
         B2          6
A2       B1          5
         B2         14
dtype: int64

And to reduce each group's data rows to a number above but close to three with equal spacing between the values in the step column for each subgroup and forcing the biggest step to be included I use:

def filter_group(dfg, col, qty):
    col_min = dfg[col].min()
    col_max = dfg[col].max()
    col_length = dfg[col].size
    jumps = col_length-1
    jump_size = int((col_max - col_min) / jumps)
    new_jump_size =  jumps/qty
    if new_jump_size > 1:
        new_jump_size = int(new_jump_size)*jump_size
    else:
        new_jump_size = jump_size

    col_select = list(range(col_min, col_max, new_jump_size))
    col_select.append(col_max)

    return dfg[dfg[col].isin(col_select)]
grouped = df.groupby(['FactorA','FactorB'], group_keys=False).apply(lambda x: filter_group(x,'step',3))

We can verify the number of rows for the demo DataFrame:

grouped = grouped.groupby(['FactorA','FactorB'])
grouped.size()

This produces:

FactorA  FactorB
A1       B1         4
         B2         6
A2       B1         5
         B2         5
dtype: int64

I'll try, if I have time, and will post it when I modify this to use the sample function with weights related to the range of the steps column. So that I got half of the data points from the first third of the range.

EloyRD
  • 304
  • 3
  • 12