1

Bootstrap sampling in Pandas with weighting on multiple levels

Given a table as the one in the example below (with possible additional columns), I want to bootstrap samples where countries and fruits are sampled independently uniformly at random.

For each country, there is a number of fruits, a number that varies between countries.

To make it clear what I am looking for, I have created a series (1-4) of sampling strategies, starting simple and getting more and more towards what I want:

Sample M fruits per country ...

  1. ... uniformly.
  2. ... inversely proportional to the number of occurrences of the fruit.
  3. ... (on average) uniformly, but bootstrap the countries.
  4. ... (on average) inversely proportional to the number of occurrences of the fruit, but bootstrap the countries.

As a minimal example for my question, I have chosen countries and fruits.

| Country |   Fruit    |
| ------- | ---------- |
| USA     | Pineapple  |
| USA     | Apple      |
| Canada  | Watermelon |
| Canada  | Banana     |
| Canada  | Apple      |
| Mexico  | Cherry     |
| Mexico  | Apple      |
| Mexico  | Apple      |
| ...     | ...        |
Create example data:
import pandas as pd
import numpy as np

df = pd.DataFrame(
    np.array([
        ['USA', 'USA', 'Canada', 'Canada', 'Canada', 'Canada', 'Canada', 'Mexico', 'Mexico', 'Mexico', 'Mexico', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'UK', 'UK', 'France', 'France', 'Germany', 'Germany', 'Germany', 'Germany', 'Germany', 'Italy', 'Italy', 'Spain', 'Spain', 'Spain', 'Spain', 'Spain'],
        ['Pineapple', 'Apple', 'Pineapple', 'Apple', 'Cherry', 'Watermelon', 'Orange', 'Apple', 'Banana', 'Cherry', 'Orange', 'Watermelon', 'Banana', 'Apple', 'Blueberry', 'Cherry', 'Apple', 'Banana', 'Blueberry', 'Banana', 'Apple', 'Cherry', 'Blueberry', 'Pineapple', 'Pineapple', 'Watermelon', 'Pineapple', 'Watermelon', 'Apple', 'Orange', 'Blueberry'],
    ]).T,
        columns=['Country', 'Fruit'],
).set_index('Country')
df['other columns'] = '...'
Setup:
M = 10  # number of fruits to sample per country
rng = np.random.default_rng(seed=123)  # set seed for reproducibility

# create weights for later use
fruit_weights = 1 / df.groupby('Fruit').size().rename('fruit_weights')
country_weights = 1 / df.groupby('Country').size().rename('country_weights')

# normalize weights to sum to 1
fruit_weights /= fruit_weights.sum()
country_weights /= country_weights.sum()
(1) Sample M fruits per country uniformly:
sampled_fruits = df.groupby('Country').sample(n=M, replace=True, random_state=rng)
(2) Sample M fruits per country inversely proportional to the number of occurrences of the fruit:
df2 = df.join(fruit_weights, on='Fruit')  # add weights to a copy of the original dataframe
sampled_fruits = df2.groupby('Country').sample(
    n=M,
    replace=True,
    random_state=rng,
    weights='fruit_weights',
)
(3) Sample M fruits per country (on average) uniformly, but bootstrap the countries:
sampled_fruits = pd.concat(
    {
        s: df.sample(
            n=df.index.nunique(),  # number of countries
            weights=country_weights,
            replace=True,
            random_state=rng,
        )
        for s in range(M)
    },
    names=['sample', 'Country'],
).reset_index('sample')
(4) Sample M fruits per country (on average) inversely proportional to the number of occurrences of the fruit, but bootstrap the countries:
df4 = df.join(fruit_weights, on='Fruit')

# normalize fruit weights to sum to 1 per country to not affect the country weights
df4['fruit_weights'] = df4.fruit_weights.groupby('Country').transform(lambda x: x / x.sum())

df4 = df4.join(country_weights, on='Country')

weight_cols = [c for c in df4.columns if '_weights' in c]
weights = df4[weight_cols].prod(axis=1)
df4 = df4.drop(columns=weight_cols)
sampled_fruits = pd.concat(
    {
        s: df4.sample(
            n=df.index.nunique(),  # number of countries
            weights=weights,
            replace=True,
            random_state=rng,
        )
        for s in range(M)
    },
    names=['sample', 'Country'],
).reset_index('sample')

Number (4) almost accomplishes what I want. The countries and fruits are sampled independently uniformly at random.

There is only one issue:

Assume now that I also want to sample vegetables and then (somehow) compare the results to the results from the fruits. Assume that the countries remain the same, but the number of different vegetables is not equal to the number of different fruits, neither overall, nor for a given country (at least not for all countries).

This will result in different sets of countries being sampled for fruits and vegetables for any given bootstrap iteration formula. To clarify, for each bootstrap iteration formula, the sampled countries should be identical for fruits and vegetables, i.e.

for m in range(M):
    assert all(sampled_fruits[sampled_fruits['sample'] == m].index == sampled_vegetables[sampled_vegetables['sample'] == m].index)

(I know how to achieve the result I want using nested for-loops, sampling a country followed by a fruit/vegetable, but this is something I want to avoid.)


(The fruits and vegetables are just random things chosen to illustrate my question. In my real use case, the countries are samples in a test set, and the fruits and vegetables are two different groups of humans, where each human has assessed / made predictions on a subset of the test set.)


Nick ODell
  • 15,465
  • 3
  • 32
  • 66
Filip
  • 759
  • 4
  • 17
  • I can think of a way which doesn't involve a for loop, which is to construct a probability matrix which specifies for each country the probability of choosing a fruit, then use np.random.randint to index into the country axis of that matrix, then do a [vectorized weighted selection](https://stackoverflow.com/questions/34187130/fast-random-weighted-selection-across-all-rows-of-a-stochastic-matrix) on the resulting matrix, but it's not clear that would actually be faster than a for loop. Would depend on the cardinality of fruits/vegetables vs countries. – Nick ODell May 04 '23 at 03:07
  • 1
    Some suggestions which on how to make this question easier to answer: 1) You know how to do the fruit/vegetable sampling with a for loop - why not show it? 2) If this is a performance concern, then make that an explicit criteria. Benchmark your slow solution and invite people to improve on it. There may be a better solution involving a for loop than one without. – Nick ODell May 04 '23 at 03:12

0 Answers0