Note: This question is not the same as an answer here: "Pandas: sample each group after groupby"
Trying to figure out how to use pandas.DataFrame.sample
or any other function to balance this data:
df[class].value_counts()
c1 9170
c2 5266
c3 4523
c4 2193
c5 1956
c6 1896
c7 1580
c8 1407
c9 1324
I need to get a random sample of each class (c1, c2, .. c9) where sample size is equal to the size of a class with min number of instances. In this example sample size should be the size of class c9 = 1324.
Any simple way to do this with Pandas?
Update
To clarify my question, in the table above :
c1 9170
c2 5266
c3 4523
...
Numbers are counts of instances of c1,c2,c3,... classes, so actual data looks like this:
c1 'foo'
c2 'bar'
c1 'foo-2'
c1 'foo-145'
c1 'xxx-07'
c2 'zzz'
...
etc.
Update 2
To clarify more:
d = {'class':['c1','c2','c1','c1','c2','c1','c1','c2','c3','c3'],
'val': [1,2,1,1,2,1,1,2,3,3]
}
df = pd.DataFrame(d)
class val
0 c1 1
1 c2 2
2 c1 1
3 c1 1
4 c2 2
5 c1 1
6 c1 1
7 c2 2
8 c3 3
9 c3 3
df['class'].value_counts()
c1 5
c2 3
c3 2
Name: class, dtype: int64
g = df.groupby('class')
g.apply(lambda x: x.sample(g.size().min()))
class val
class
c1 6 c1 1
5 c1 1
c2 4 c2 2
1 c2 2
c3 9 c3 3
8 c3 3
Looks like this works. Main questions:
How g.apply(lambda x: x.sample(g.size().min()))
works? I know what 'lambda` is, but:
- What is passed to
lambda
inx
in this case? - What is
g
ing.size()
? - Why output contains 6,5,4, 1,8,9 numbers? What do they mean?