Stratified Sampling in Pandas

Question

I've looked at the Sklearn stratified sampling docs as well as the pandas docs and also Stratified samples from Pandas and sklearn stratified sampling based on a column but they do not address this issue.

Im looking for a fast pandas/sklearn/numpy way to generate stratified samples of size n from a dataset. However, for rows with less than the specified sampling number, it should take all of the entries.

Concrete example:

Thank you! :)

I think the title of the question should be changed to reflect that the stratification is of a feature column, not the target column. — wordsforthewise, Dec 02 '20 at 02:58
You could almost use the `imblearn` downsampling or undersampling techniques for this: https://imbalanced-learn.org/stable/under_sampling.html — wordsforthewise, Mar 30 '21 at 03:20

score 106 · Accepted Answer · answered May 22 '17 at 14:20

106

Use min when passing the number to sample. Consider the dataframe df

df = pd.DataFrame(dict(
        A=[1, 1, 1, 2, 2, 2, 2, 3, 4, 4],
        B=range(10)
    ))

df.groupby('A', group_keys=False).apply(lambda x: x.sample(min(len(x), 2)))

   A  B
1  1  1
2  1  2
3  2  3
6  2  6
7  3  7
9  4  9
8  4  8

answered May 22 '17 at 14:20

piRSquared

285,575
57
475
624

4

@piRSquared, let's say I have a df with 1M rows, I want to sample 10k of it, with at least 10 samples from each user_id, how would you approach it? – joddm Mar 08 '19 at 12:47
@whitfa still works for me, and the linked change shouldn't impact it at all. What version of pandas are you using? I'm using `0.25` – piRSquared Sep 19 '19 at 15:08
Apologies @piRSquared, looks like I was mistaken! I will delete my original comment. – whitfa Sep 23 '19 at 09:27
When my grouping column has high cardinality this solution is quite slow. Which I guess makes sense. Anyways, can you think of a way to speed it up in scenarios like this? – hipoglucido Jun 16 '21 at 07:43

Ilya Prokin · Answer 2 · 2018-12-05T09:42:16.063

18

Extending the groupby answer, we can make sure that sample is balanced. To do so, when for all classes the number of samples is >= n_samples, we can just take n_samples for all classes (previous answer). When minority class contains < n_samples, we can take the number of samples for all classes to be the same as of minority class.

def stratified_sample_df(df, col, n_samples):
    n = min(n_samples, df[col].value_counts().min())
    df_ = df.groupby(col).apply(lambda x: x.sample(n))
    df_.index = df_.index.droplevel(0)
    return df_

edited Dec 05 '18 at 09:42

answered Dec 04 '18 at 14:58

Ilya Prokin

684
6
11

8

An explanation, what the posted code does and how this addresses the problem in the question, rarely fails to improve an answer. – MBT Dec 04 '18 at 16:18

irkinosor · Answer 3 · 2019-02-16T10:37:44.870

10

the following sample a total of N row where each group appear in its original proportion to the nearest integer, then shuffle and reset the index using:

df = pd.DataFrame(dict(
    A=[1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4],
    B=range(20)
))

Short and sweet:

df.sample(n=N, weights='A', random_state=1).reset_index(drop=True)

Long version

df.groupby('A', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(df))))).sample(frac=1).reset_index(drop=True)

edited Feb 16 '19 at 10:37

answered Feb 16 '19 at 10:26

irkinosor

766
12
26

9

There is an issue with the short version, it is not keeping the origin proportions: it doesn't really make sense to use the parameter weights = the category column, e.g. it could a string. If you really want to use df.sample, you need to compute an additional column equal to the frequency of the category column. But the long version works! – steco Jul 12 '19 at 09:43
Short version doesn't work for me for binary data, e.g. `df = pd.DataFrame({'A': [np.random.randint(0, 2) for _ in range(100)]})` – npit Nov 30 '21 at 16:20
Will not work if the column `A` is not numeric. – hafiz031 Dec 21 '21 at 06:41

score 2 · Answer 4 · answered Apr 21 '22 at 03:42

So I tried all the methods above and they are still not quite what I wanted (will explain why).

Step 1: Yes, we need to `groupby` the target variable, let's call it `target_variable`. So the first part of the code will look like this:

df.groupby('target_variable', group_keys=False)

I am setting group_keys=False as I am not trying to inherit indexes into the output.

Step2: use `apply` to sample from various classes within the `target_variable`.

This is where I found the above answers not quite universal. In my example, this is what I have as label numbers in the df:

array(['S1','S2','normal'], dtype=object),
array([799, 2498,3716391])

So you can see how imbalanced my target_variable is. What I need to do is make sure I am taking the number of S1 labels as the minimum number of samples for each class.

min(np.unique(df['target_variable'], return_counts=True))

This is what @piRSquared answer is lacking. Then you want to choose between the min of the class numbers, 799 here, and the number of each and every class. This is not a general rule and you can take other numbers. For example:

max(len(x), min(np.unique(data_use['snd_class'], return_counts=True)[1])

which will give you the max of your smallest class compared to the number of each and every class.

The other technical issue in their answer is you are advised to shuffle your output once you have sampled. As in you do not want all S1 samples in consecutive rows then S2, so forth. You want to make sure your rows are stacked randomly. That is when sample(frac=1) comes in. The value 1 is because I want to return all the data after shuffling. If you need less for any reason, feel free to provide a fraction like 0.6 which will return 60% of the original sample, shuffled.

Step 3: Final line looks like this for me:

df.groupby('target_variable', group_keys=False).apply(lambda x: x.sample(min(len(x), min(np.unique(df['target_variable'], return_counts=True)[1]))).sample(frac=1))

I am selecting index 1 in np.unique(df['target_variable]. return_counts=True)[1] as this is appropriate in getting the numbers of each classes as a numpy array. Feel free to modify as appropriate.

score 2 · Answer 5 · answered Nov 18 '22 at 17:23

Based on user piRSquared's response, we might have:

import pandas as pd


def stratified_sample(df: pd.DataFrame, groupby_column: str, sampling_rate: float = 0.01) -> pd.DataFrame:
    assert 0.0 < sampling_rate <= 1.0
    assert groupby_column in df.columns

    num_rows = int((df.shape[0] * sampling_rate) // 1)
    num_classes = len(df[groupby_column].unique())
    num_rows_per_class = int(max(1, ((num_rows / num_classes) // 1)))
    df_sample = df.groupby(groupby_column, group_keys=False).apply(lambda x: x.sample(min(len(x), num_rows_per_class)))

    return df_sample

Stratified Sampling in Pandas

5 Answers5

Step 1: Yes, we need to `groupby` the target variable, let's call it `target_variable`. So the first part of the code will look like this:

Step2: use `apply` to sample from various classes within the `target_variable`.

Step 3: Final line looks like this for me:

Linked

Stratified Sampling in Pandas

5 Answers5

Step 1: Yes, we need to groupby the target variable, let's call it target_variable. So the first part of the code will look like this:

Step2: use apply to sample from various classes within the target_variable.

Step 3: Final line looks like this for me:

Linked

Step 1: Yes, we need to `groupby` the target variable, let's call it `target_variable`. So the first part of the code will look like this:

Step2: use `apply` to sample from various classes within the `target_variable`.