How to do a random stratified sampling with Python (Not a train/test split)?

Question

I am looking for the best way to do a random stratified sampling like survey and polls. I don't want to do a sklearn.model_selection.StratifiedShuffleSplit since I am not doing a supervised learning and I have no target. I just want to create random stratified samples from pandas DataFrame (https://www.investopedia.com/terms/stratified_random_sampling.asp).

Python is my main language.

Thank you for any help

The " I would like the sample to be as representative of my population as it can" part of your question seems to make it a really difficult problem to address, thus too broad for stackoverflow... — P. Camilleri, May 06 '18 at 00:25
look into [`pandas.DataFrame.sample()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html) — sacuL, May 06 '18 at 00:26
You probably need to figure out the statistical part of this question first (describe much more precisely what sampling procedure you need to implement) and also describe what data you have (do you have full population data? weighted survey data?), before this will be in scope for this site. — Stuart, May 06 '18 at 00:37
Thanks sacul. I found it pretty handy and used it in my solution. — asl, May 06 '18 at 21:21

score 10 · Answer 1 · answered Sep 05 '19 at 06:17

10

Given that the variables are binned, the following one liner should give you the desired output. I see that scikit-learn is mainly employed for purposes other than yours but using a function from it should not do any harm.

Note that if you have a scikit-learn version earlier than the 0.19.0, the sampling result might contain duplicate rows.

If you test the following method, please share whether it behaves as expected or not.

from sklearn.model_selection import train_test_split

stratified_sample, _ = train_test_split(population, test_size=0.999, stratify=population[['income', 'sex', 'age']])

answered Sep 05 '19 at 06:17

Furkan Gursoy

109
1
4

I tested this myself. Unfortunately, this does not have the desired behavior. Since it is designed for keeping the relative frequencies the same ACROSS a train and test set, it is not focused on making the frequencies equal WITHIN a single set. You can read more [here](https://scikit-learn.org/stable/modules/cross_validation.html#stratification) – Renel Chesak Jun 23 '21 at 12:54
I found a working solution [here](https://stackoverflow.com/a/44115314/7560187). – Renel Chesak Jun 23 '21 at 13:29

asl · Answer 2 · 2018-05-06T19:47:14.533

This is my best solution so far. It is important to bin continuous variables before and to have a minimum of observations for each stratum.

In this example, I am :

Generating a population
Sampling in a pure random way
Sampling in a random stratified way

When comparing both samples, the stratified one is much more representative of the overall population.

If anyone has an idea of a more optimal way to do it, please feel free to share.

import pandas as pd
import numpy as np

# Generate random population (100K)

population = pd.DataFrame(index=range(0,100000))
population['income'] = 0
population['income'].iloc[39000:80000] = 1
population['income'].iloc[80000:] = 2
population['sex'] = np.random.randint(0,2,100000)
population['age'] = np.random.randint(0,4,100000)

pop_count = population.groupby(['income', 'sex', 'age'])['income'].count()

# Random sampling (100 observations out of 100k)

random_sample = population.iloc[
    np.random.randint(
        0, 
        len(population), 
        int(len(population) / 1000)
    )
]

# Random Stratified Sampling (100 observations out of 100k)

stratified_sample = list(map(lambda x : population[
    (
        population['income'] == pop_count.index[x][0]
    ) 
    &
    (
        population['sex'] == pop_count.index[x][1]
    )
    &
    (
        population['age'] == pop_count.index[x][2]
    )
].sample(frac=0.001), range(len(pop_count))))

stratified_sample = pd.concat(stratified_sample)

score 0 · Answer 3 · edited May 19 '23 at 21:19

You could do this without scikit-learn using a function similar to this:

import pandas as pd
import numpy as np

def stratified_sampling(df, strata_col, sample_size):
    groups = df.groupby(strata_col)
    sample = pd.DataFrame()
    
    for _, group in groups:
        stratum_sample = group.sample(frac=sample_size, replace=False, random_state=7)
        sample = sample.append(stratum_sample)
    
    return sample

In the above:

df is the DataFrame to be sampled
strata_col is the column representing the strata (e.g 'gender') of intereest
sample_size is the desired sample size (e.g 0.2 for 20% of the data)

You could then call stratified_sampling as follows:

sample = stratified_sampling(df_to_be_sampled, 'gender', 0.2)

This will return a new DataFrame called sample containing the randomly sampled data. Note I've chosen random_state=7 for testing and reproducibility but this is of course arbitrary.

How to do a random stratified sampling with Python (Not a train/test split)?

3 Answers3