6

I have a dataframe which represents a population, with each column denoting a different quality/ characteristic of that person. How can I get a sample of that dataframe/ population, which is representative of the population as a whole across all characteristics.

Suppose I have a dataframe which represents a workforce of 650 people as follows:

import pandas as pd
import numpy as np
c = np.random.choice

colours = ['blue', 'yellow', 'green', 'green... no, blue']
knights = ['Bedevere', 'Galahad', 'Arthur', 'Robin', 'Lancelot']
qualities = ['wise', 'brave', 'pure', 'not quite so brave']

df = pd.DataFrame({'name_id':c(range(3000), 650, replace=False),
              'favourite_colour':c(colours, 650),
              'favourite_knight':c(knights, 650),
              'favourite_quality':c(qualities, 650)})

I can get a sample of the above that reflects the distribution of a single column as follows:

# Find the distribution of a particular column using value_counts and normalize:
knight_weight = df['favourite_knight'].value_counts(normalize=True)

# Add this to my dataframe as a weights column:
df['knight_weight'] = df['favourite_knight'].apply(lambda x: knight_weight[x])

# Then sample my dataframe using the weights column I just added as the 'weights' argument:
df_sample = df.sample(140, weights=df['knight_weight'])

This will return a sample dataframe (df_sample) such that:

df_sample['favourite_knight'].value_counts(normalize=True)
is approximately equal to
df['favourite_knight'].value_counts(normalize=True)

My question is this: How can I generate a sample dataframe (df_sample) such that the above i.e.:

df_sample[column].value_counts(normalize=True)
is approximately equal to
df[column].value_counts(normalize=True)

is true for all columns (except 'name_id') instead of just one of them? population of 650 with a sample size of 140 is approximately the sizes I'm working with so performance isn't too much of an issue. I'll happily accept solutions that take a couple of minutes to run as this will still be considerably faster than producing the above sample manually. Thank you for any help.

Linden
  • 531
  • 3
  • 12

2 Answers2

5

You create a combined feature column, weight that one and draw with it as weights:

df["combined"] = list(zip(df["favourite_colour"],
                          df["favourite_knight"],
                          df["favourite_quality"]))

combined_weight = df['combined'].value_counts(normalize=True)

df['combined_weight'] = df['combined'].apply(lambda x: combined_weight[x])

df_sample = df.sample(140, weights=df['combined_weight'])

This will need an additional step of dividing by the count of the specific weight so sum up to 1 - see Ehsan Fathi post.

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
  • I think you don't need the `normalize=True` – Dani Mesejo Nov 23 '20 at 12:43
  • 1
    @DaniMesejo afaik the normalize will simply do a full sum over all and divide each count by that so you get values (0,1] - as sample can either use those or simply the amounts as is - you are correct. It is not needed as `pd.samle()` provided you give it weights, will _If weights do not sum to 1, they will be normalized to sum to 1_. So normalization is done anyhow and the OP decided to do it directly when gathering the weights so I copied that over ;) – Patrick Artner Nov 23 '20 at 13:05
  • I'd like to know how to do this in R. – Kaleb Coberly Jun 16 '21 at 00:22
  • @Kaleb no R expertise here - maybe ask a new question, provide a dataset in R and link to this question for explanation purposes? – Patrick Artner Jun 16 '21 at 05:11
  • 1
    Thank you so much for this. Using `weights` in pandas sample is the way to draw representative distributions! – Iopheam Mar 26 '22 at 15:02
  • 1
    I believe the code above is missing a point. When you add the weights to the corresponding rows they don't sum up to 1. In this case pandas will normalize it again but it will result in the wrong distribution. You should divide the weight associated with each value by the count of the value again. The normalize argument in the value_counts divides it by the total number of rows – Ehsan Fathi Jul 15 '22 at 19:11
1

I think this will do what you need:

df["combined"] = list(zip(df["favourite_colour"],
                      df["favourite_knight"],
                      df["favourite_quality"]))
weight = df['combined'].value_counts(normalize=True)
counts = df['combined'].value_counts()
df['combined_weight'] = df['routingnumber'].apply(lambda x: 
weight[x]/counts[x])
df_sample = df.sample(140, weights=df['combined_weight'])

Pay attention that normalize=True will divide the total number of that value by the total number of the records. If you use that as the weight for your rows then the weight column won't sum up to 1 and pandas will normalize it again which will result in the wrong distribution.

Ehsan Fathi
  • 598
  • 5
  • 21