0

I have some data made up of a class (X) and some binary (Y). I would like to equalise the class sizes by over sampling the smaller classes. For example if I start with:

Df_01 = pd.DataFrame({'X' : [1,1,1,1,1,1,1,2,2],
                      'Y1': [1,1,1,1,1,0,0,0,1],
                      'Y2': [0,0,0,0,0,1,0,0,0]})

Then I would like to get:

Df_02 = pd.DataFrame({'X' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2],
                      'Y1': [1,1,1,1,1,0,0,0,1,0,1,0,1,0,1,0],
                      'Y2': [0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0]})

I've attempted to do it:

# Sort the data by class
Ma_01 = Df_01.groupby('X')
Di_01 = {}
for name, group in Ma_01:
    Di_01[str(name)] = group

# Size of each class
Se_01 = Df_01.groupby('X').size()

# Size of the biggest class
In_Bi = max(Se_01)

# How much over sampling would equalise the class sizes?
Se_Ra =  In_Bi / Se_01
Di_Ra =  Se_Ra.to_dict()

But when I try:

# Copy each dataframe
Di_03 = {}
for x in Di_01:
    for y in range(int(Di_Ra[int(x)])):
        if not Di_03:
            Di_03[x] = Di_01[x]
        else:
            Di_03[x] = Di_03[x] .append(Di_01[x])

# Concatonate the dictionary to a single dataframe
df_03 = pd.concat(Di_03.values(), ignore_index=True)

I get

KeyError: '2'
R. Cox
  • 819
  • 8
  • 25

1 Answers1

0

Thanks for finding the duplicate Matthew Strawbridge! Ayhan's answer to the original works on my data:

max_size = Df_01['X'].value_counts().max()

lst = [Df_01]

for class_index, group in Df_01.groupby('X'):
    lst.append(group.sample(max_size-len(group), replace=True))

Df_03 = pd.concat(lst)
R. Cox
  • 819
  • 8
  • 25
  • As Ayhan mentions in the original, I should "maybe add some noise to it". This is because over sampled data can be very repetitive and this causes machine learning tools to over fit to the over sampled data. I would use SMOTE to add the noise but I don't think that works for binary. I'll see if I can swap the values within classes. – R. Cox Nov 20 '18 at 13:19