15

I am looking for the best way to do a random stratified sampling like survey and polls. I don't want to do a sklearn.model_selection.StratifiedShuffleSplit since I am not doing a supervised learning and I have no target. I just want to create random stratified samples from pandas DataFrame (https://www.investopedia.com/terms/stratified_random_sampling.asp).

Python is my main language.

Thank you for any help

asl
  • 471
  • 2
  • 4
  • 13
  • 4
    The " I would like the sample to be as representative of my population as it can" part of your question seems to make it a really difficult problem to address, thus too broad for stackoverflow... – P. Camilleri May 06 '18 at 00:25
  • look into [`pandas.DataFrame.sample()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html) – sacuL May 06 '18 at 00:26
  • You probably need to figure out the statistical part of this question first (describe much more precisely what sampling procedure you need to implement) and also describe what data you have (do you have full population data? weighted survey data?), before this will be in scope for this site. – Stuart May 06 '18 at 00:37
  • Thanks sacul. I found it pretty handy and used it in my solution. – asl May 06 '18 at 21:21

3 Answers3

10

Given that the variables are binned, the following one liner should give you the desired output. I see that scikit-learn is mainly employed for purposes other than yours but using a function from it should not do any harm.

Note that if you have a scikit-learn version earlier than the 0.19.0, the sampling result might contain duplicate rows.

If you test the following method, please share whether it behaves as expected or not.

from sklearn.model_selection import train_test_split

stratified_sample, _ = train_test_split(population, test_size=0.999, stratify=population[['income', 'sex', 'age']])
Furkan Gursoy
  • 109
  • 1
  • 4
  • I tested this myself. Unfortunately, this does not have the desired behavior. Since it is designed for keeping the relative frequencies the same ACROSS a train and test set, it is not focused on making the frequencies equal WITHIN a single set. You can read more [here](https://scikit-learn.org/stable/modules/cross_validation.html#stratification) – Renel Chesak Jun 23 '21 at 12:54
  • I found a working solution [here](https://stackoverflow.com/a/44115314/7560187). – Renel Chesak Jun 23 '21 at 13:29
4

This is my best solution so far. It is important to bin continuous variables before and to have a minimum of observations for each stratum.

In this example, I am :

  1. Generating a population
  2. Sampling in a pure random way
  3. Sampling in a random stratified way

When comparing both samples, the stratified one is much more representative of the overall population.

If anyone has an idea of a more optimal way to do it, please feel free to share.


import pandas as pd
import numpy as np

# Generate random population (100K)

population = pd.DataFrame(index=range(0,100000))
population['income'] = 0
population['income'].iloc[39000:80000] = 1
population['income'].iloc[80000:] = 2
population['sex'] = np.random.randint(0,2,100000)
population['age'] = np.random.randint(0,4,100000)

pop_count = population.groupby(['income', 'sex', 'age'])['income'].count()

# Random sampling (100 observations out of 100k)

random_sample = population.iloc[
    np.random.randint(
        0, 
        len(population), 
        int(len(population) / 1000)
    )
]

# Random Stratified Sampling (100 observations out of 100k)

stratified_sample = list(map(lambda x : population[
    (
        population['income'] == pop_count.index[x][0]
    ) 
    &
    (
        population['sex'] == pop_count.index[x][1]
    )
    &
    (
        population['age'] == pop_count.index[x][2]
    )
].sample(frac=0.001), range(len(pop_count))))

stratified_sample = pd.concat(stratified_sample)
asl
  • 471
  • 2
  • 4
  • 13
0

You could do this without scikit-learn using a function similar to this:

import pandas as pd
import numpy as np

def stratified_sampling(df, strata_col, sample_size):
    groups = df.groupby(strata_col)
    sample = pd.DataFrame()
    
    for _, group in groups:
        stratum_sample = group.sample(frac=sample_size, replace=False, random_state=7)
        sample = sample.append(stratum_sample)
    
    return sample

In the above:

  • df is the DataFrame to be sampled
  • strata_col is the column representing the strata (e.g 'gender') of intereest
  • sample_size is the desired sample size (e.g 0.2 for 20% of the data)

You could then call stratified_sampling as follows:

sample = stratified_sampling(df_to_be_sampled, 'gender', 0.2)

This will return a new DataFrame called sample containing the randomly sampled data. Note I've chosen random_state=7 for testing and reproducibility but this is of course arbitrary.

tdy
  • 36,675
  • 19
  • 86
  • 83