13

How can I sample a pandas dataframe or graphlab sframe based on a given class\label distribution values eg: I want to sample an data frame having a label\class column to select rows such that each class label is equally fetched thereby having a similar frequency for each class label corresponding to a uniform distribution of class labels . Or best would be to get samples according to the class distribution we want.

+------+-------+-------+
| col1 | clol2 | class |
+------+-------+-------+
| 4    | 45    | A     |
+------+-------+-------+
| 5    | 66    | B     |
+------+-------+-------+
| 5    | 6     | C     |
+------+-------+-------+
| 4    | 6     | C     |
+------+-------+-------+
| 321  | 1     | A     |
+------+-------+-------+
| 32   | 432   | B     |
+------+-------+-------+
| 5    | 3     | B     |
+------+-------+-------+

given a huge dataframe like above and the required frequency distribution like below:
+-------+--------------+
| class | nostoextract |
+-------+--------------+
| A     | 2            |
+-------+--------------+
| B     | 2            |
+-------+--------------+
| C     | 2            |
+-------+--------------+


The above should extract rows from the first dataframe based on the given frequency distribution in the second frame where the frequency count values are given in nostoextract column to give a sampled frame where each class appears at max 2 times. should ignore and continue if cant find sufficient classes to meet the required count. The resulting dataframe is to be used for a decision tree based classifier.

As a commentator puts it the sampled dataframe has to contain nostoextract different instances of the corresponding class? Unless there are not enough examples for a given class in which case you just take all the available ones.

papayawarrior
  • 1,027
  • 7
  • 10
stackit
  • 3,036
  • 9
  • 34
  • 62
  • 1
    Could you add some examples of what you want to achieve? And did you look at `pandas.DataFrame.sample`? (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html) – chris-sc Oct 13 '15 at 08:08
  • @chris-sc yes it does not allow to sample based on class column – stackit Oct 13 '15 at 08:23
  • basically I want to sample a skewed data frame such that all the class labels are sufficiently represented as much as possible. The class labels are in the "label" column. This is fed to a classifier. @chris-sc – stackit Oct 13 '15 at 08:25
  • 1
    I think you want [`StratifiedKFold`](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.StratifiedKFold.html#sklearn.cross_validation.StratifiedKFold) this returns iterators that preserve a uniform split of your data for each class label – EdChum Oct 13 '15 at 08:40
  • @EdChum No it does not give an option to specify the class distribution, and does not do what is asked. It just samples preserving existing distribution to the new samples. – stackit Oct 13 '15 at 10:35
  • 2
    Sorry can you post example code and desired output as I don't quite get what you want – EdChum Oct 13 '15 at 11:29
  • @EdChum posted an example , do let me know any more doubts – stackit Oct 13 '15 at 12:02
  • So are you wanting just the same number of samples for each label? For instance in your example although you have 3 'Bs' you end up with 2 of each class? – EdChum Oct 13 '15 at 12:14
  • So you basically want to bootstrap, where each bootstrap sample has to contain `nostoextract` different instances of the corresponding class? Unless there are not enough examples for a given class in which case you just take all the available ones? – swenzel Oct 13 '15 at 12:32
  • @swenzel yes you are right – stackit Oct 13 '15 at 12:49

4 Answers4

5

Can you split your first dataframe into class-specific sub-dataframes, and then sample at will from those?

i.e.

dfa = df[df['class']=='A']
dfb = df[df['class']=='B']
dfc = df[df['class']=='C']
....

Then once you've split/created/filtered on dfa, dfb, dfc, pick a number from the top as desired (if dataframes don't have any particular sort-pattern)

 dfasamplefive = dfa[:5]

Or use the sample method as described by a previous commenter to directly take a random sample:

dfasamplefive = dfa.sample(n=5)

If that suits your needs, all that's left to do is automate the process, feeding in the number to be sampled from the control dataframe you have as your second dataframe containing the desired number of samples.

Thomas Kimber
  • 10,601
  • 3
  • 25
  • 42
4

I think this will solve your problem:

import pandas as pd

data = pd.DataFrame({'cols1':[4, 5, 5, 4, 321, 32, 5],
                     'clol2':[45, 66, 6, 6, 1, 432, 3],
                     'class':['A', 'B', 'C', 'C', 'A', 'B', 'B']})

freq = pd.DataFrame({'class':['A', 'B', 'C'],
                     'nostoextract':[2, 2, 2], })

def bootstrap(data, freq):
    freq = freq.set_index('class')

    # This function will be applied on each group of instances of the same
    # class in `data`.
    def sampleClass(classgroup):
        cls = classgroup['class'].iloc[0]
        nDesired = freq.nostoextract[cls]
        nRows = len(classgroup)

        nSamples = min(nRows, nDesired)
        return classgroup.sample(nSamples)

    samples = data.groupby('class').apply(sampleClass)

    # If you want a new index with ascending values
    # samples.index = range(len(samples))

    # If you want an index which is equal to the row in `data` where the sample
    # came from
    samples.index = samples.index.get_level_values(1)

    # If you don't change it then you'll have a multiindex with level 0
    # being the class and level 1 being the row in `data` where
    # the sample came from.

    return samples

print(bootstrap(data,freq))

Prints:

  class  clol2  cols1
0     A     45      4
4     A      1    321
1     B     66      5
5     B    432     32
3     C      6      4
2     C      6      5

If you don't want the result to be ordered by classes, you can permute it in the end.

Community
  • 1
  • 1
swenzel
  • 6,745
  • 3
  • 23
  • 37
1

Here's a solution for SFrames. It's not exactly what you want, because it samples points randomly, so that the results don't necessarily have precisely the number of rows you specify. An exact method would probably shuffle the data randomly then take the first k rows for a given class, but this gets you pretty darn close.

import random
import graphlab as gl

## Construct data.
sf = gl.SFrame({'col1': [4, 5, 5, 4, 321, 32, 5],
                'col2': [45, 66, 6, 6, 1, 432, 3],
                'class': ['A', 'B', 'C', 'C', 'A', 'B', 'B']})

freq = gl.SFrame({'class': ['A', 'B', 'C'],
                  'number': [3, 1, 0]})

## Count how many instances of each class and compute a sampling
#  probability.
grp = sf.groupby('class', gl.aggregate.COUNT)
freq = freq.join(grp, on ='class', how='left')
freq['prob'] = freq.apply(lambda x: float(x['number']) / x['Count'])

## Join the sampling probability back to the original data.
sf = sf.join(freq[['class', 'prob']], on='class', how='left')

## Sample the original data, then subset.
sf['sample_mask'] = sf.apply(lambda x: 1 if random.random() <= x['prob'] 
                             else 0)
sf2 = sf[sf['sample_mask'] == 1]

In my sample run, I happened to get the exact number of samples I specified, but again, this is not guaranteed with this solution.

>>> sf2
+-------+------+------+
| class | col1 | col2 |
+-------+------+------+
|   A   |  4   |  45  |
|   A   | 321  |  1   |
|   B   |  32  | 432  |
+-------+------+------+
papayawarrior
  • 1,027
  • 7
  • 10
0

I think I've got a clean solution.

Lets setup df:

df = pd.DataFrame({'cols1':[4, 5, 5, 4, 321, 32, 5],
                   'clol2':[45, 66, 6, 6, 1, 432, 3],
                   'class':['A', 'B', 'C', 'C', 'A', 'B', 'B']})

One-liner:

An even distribution can be achieved with one-liner:

df.groupby('class', group_keys=False).apply(lambda x: x.sample(2))


With specified distribution:

If you want to specify the distribution, you need to modify it a little bit:

freq = {'A':1,'B':2,'C':3}

def get_sample(df,freq):
    sample_size = freq[df['class'].iloc[0]]
    return df.sample(sample_size, replace=True)

df.groupby('class', group_keys=False).apply(lambda x: get_sample(x,freq))

replace=True allows you to oversample a class.

Yurkee
  • 795
  • 2
  • 9
  • 23