Pandas : balancing data

Question

Note: This question is not the same as an answer here: "Pandas: sample each group after groupby"

Trying to figure out how to use pandas.DataFrame.sample or any other function to balance this data:

df[class].value_counts()

c1    9170
c2    5266
c3    4523
c4    2193
c5    1956
c6    1896
c7    1580
c8    1407
c9    1324

I need to get a random sample of each class (c1, c2, .. c9) where sample size is equal to the size of a class with min number of instances. In this example sample size should be the size of class c9 = 1324.

Any simple way to do this with Pandas?

Update

To clarify my question, in the table above :

Numbers are counts of instances of c1,c2,c3,... classes, so actual data looks like this:

c1 'foo'
c2 'bar'
c1 'foo-2'
c1 'foo-145'
c1 'xxx-07'
c2 'zzz'
...

etc.

Update 2

To clarify more:

d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'],
     'val': [1,2,1,1,2,1,1,2,3,3]
    }

df = pd.DataFrame(d)

    class   val
0   c1  1
1   c2  2
2   c1  1
3   c1  1
4   c2  2
5   c1  1
6   c1  1
7   c2  2
8   c3  3
9   c3  3

df['class'].value_counts()

c1    5
c2    3
c3    2
Name: class, dtype: int64

g = df.groupby('class')
g.apply(lambda x: x.sample(g.size().min()))

        class   val
class           
c1  6   c1  1
    5   c1  1
c2  4   c2  2  
    1   c2  2
c3  9   c3  3
    8   c3  3

Looks like this works. Main questions:

How g.apply(lambda x: x.sample(g.size().min())) works? I know what 'lambda` is, but:

What is passed to lambda in x in this case?
What is g in g.size()?
Why output contains 6,5,4, 1,8,9 numbers? What do they mean?

piRSquared · Accepted Answer · 2020-06-19T14:56:52.940

g = df.groupby('class')
g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))

  class  val
0    c1    1
1    c1    1
2    c2    2
3    c2    2
4    c3    3
5    c3    3

Answers to your follow-up questions

The x in the lambda ends up being a dataframe that is the subset of df represented by the group. Each of these dataframes, one for each group, gets passed through this lambda.
g is the groupby object. I placed it in a named variable because I planned on using it twice. df.groupby('class').size() is an alternative way to do df['class'].value_counts() but since I was going to groupby anyway, I might as well reuse the same groupby, use a size to get the value counts... saves time.
Those numbers are the the index values from df that go with the sampling. I added reset_index(drop=True) to get rid of it.

Should `reset_index` go _outside_ the lambda for the posted result (i.e. `apply(...).reset_index(...)`? I get a `MultiIndex` with the code as posted. — ntjess, Sep 01 '21 at 14:22

Samuel Nde · Answer 2 · 2021-10-29T02:12:51.327

17

The above answer is correct but I would love to specify that the g above is not a Pandas DataFrame object which the user most likely wants. It is a pandas.core.groupby.groupby.DataFrameGroupBy object. Pandas apply does not modify the dataframe inplace but returns a dataframe. To see this, try calling head on g and the result will be as shown below.

import pandas as pd
d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'],
     'val': [1,2,1,1,2,1,1,2,3,3]
    }

d = pd.DataFrame(d)
g = d.groupby('class')
g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))
g.head()
>>> class val
0    c1    1
1    c2    2
2    c1    1
3    c1    1
4    c2    2
5    c1    1
6    c1    1
7    c2    2
8    c3    3
9    c3    3

To fix this, you can either create a new variable or assign g to the result of the apply as shown below so that you get a Pandas DataFrame:

g = d.groupby('class')
g = pd.DataFrame(g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True)))

Calling the head now yields:

g.head()

>>>class val
0   c1   1
1   c2   2
2   c1   1
3   c1   1
4   c2   2

Which is most likely what the user wants.

edited Oct 29 '21 at 02:12

answered Oct 17 '18 at 20:14

Samuel Nde

2,565
2
23
23

2

The result of `g.apply` is a `DataFrame`, no need for this conversion. Check for yourself: `type(g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True)))`. The reason you see this behavior in your example is because `apply` does not modify the object in place -- it _returns_ a `DataFrame`. – ntjess Sep 01 '21 at 14:19
1

This is wrong, as @ntjess pointed out, there is no need for the conversion. Don't spread misinformation. It's worrying that this answer was upvoted. The result of `g.apply(...)` and `g` are two different objects. The first is a `DataFrame` which results from aggregating the `DataFrameGroupBy` object that `g` refers to. – Rodalm Oct 29 '21 at 00:12
1

This also has the potential to be unnecessarily computationally complex if the number of classes is large. It is advisable to store the min and reuse that value during sampling, i.e. `min_num_samps = g.size().min(); d = g.apply(lambda x: x.sample(min_num_samps).reset_index(drop=True))` – ntjess Nov 02 '21 at 01:50

score 13 · Answer 3 · answered Feb 20 '19 at 16:25

13

This method get randomly k elements of each class.

def sampling_k_elements(group, k=3):
    if len(group) < k:
        return group
    return group.sample(k)

balanced = df.groupby('class').apply(sampling_k_elements).reset_index(drop=True)

answered Feb 20 '19 at 16:25

Jhon Intriago Thoth

331
3
6

An explanation would be appreciated – Max Mar 04 '22 at 19:44

score 1 · Answer 4 · edited Jun 20 '20 at 09:12

"The following code works for undersampling of unbalanced classes but it's too much sorry for that.Try it! And also it works the same for upsampling problems! Good Luck!"

Import required sampling libraries

from sklearn.utils import resample

Define the majority and minority class

 df_minority9 = df[df['class']=='c9']
    df_majority1 = df[df['class']=='c1']
    df_majority2 = df[df['class']=='c2']
    df_majority3 = df[df['class']=='c3']
    df_majority4 = df[df['class']=='c4']
    df_majority5 = df[df['class']=='c5']
    df_majority6 = df[df['class']=='c6']
    df_majority7 = df[df['class']=='c7']
    df_majority8 = df[df['class']=='c8']

Unndersample majority class

 maj_class1 = resample(df_majority1, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class2 = resample(df_majority2, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class3 = resample(df_majority3, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class4 = resample(df_majority4, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class5 = resample(df_majority5, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class6 = resample(df_majority6, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class7 = resample(df_majority7, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123) 
    maj_class8 = resample(df_majority8, 
                                 replace=True,     
                                 n_samples=1324,    
                                 random_state=123)

Combine minority class with undersampled majority class

df=pd.concat([df_minority9,maj_class1,maj_class2,maj_class3,maj_class4, maj_class5,dmaj_class6,maj_class7,maj_class8])

Display new balanced class counts

 df['class'].value_counts()

This answer is so badly written. I am tempted to downvote. But then, if I downvote, I will lose a point. But, this is really poorly written (especially given the fact that someone has already commented about using a loop) — Prasad Raghavendra, Jun 07 '20 at 20:18

score 1 · Answer 5 · answered Aug 26 '21 at 14:18

I know this question is old but I stumbled across it and wasn't really happy with the solutions here and in other threads. I made a quick solution using list comprehension that works for me. Maybe it is useful to someone else:

df_for_training_grouped = df_for_training.groupby("sentiment")
df_for_training_grouped.groups.values()
frames_of_groups = [x.sample(df_for_training_grouped.size().min()) for y, x in df_for_training_grouped]
new_df = pd.concat(frames_of_groups)

The result is a dataframe which contains the same amount of entries for each group. The amount of entries is set to the size of the smallest group.

This solution saves the 'Index' column. – rborodinov May 30 '23 at 16:57 — rborodinov, May 30 '23 at 16:57

Pandas : balancing data

5 Answers5

Import required sampling libraries

Define the majority and minority class

Unndersample majority class

Combine minority class with undersampled majority class

Display new balanced class counts

Linked

Related