15

I have a pandas DataFrame which looks approximately as follows:

cli_id | X1 | X2 | X3 | ... | Xn |  Y  |
----------------------------------------
123    | 1  | A  | XX | ... | 4  | 0.1 |
456    | 2  | B  | XY | ... | 5  | 0.2 |
789    | 1  | B  | XY | ... | 5  | 0.3 |
101    | 2  | A  | XX | ... | 4  | 0.1 |
...

I have client id, few categorical attributes and Y which is probability of an event which has values from 0 to 1 by 0.1.

I need to take a stratified sample in every group (so 10 folds) of Y of size of 200

I often use this to take a stratified sample when splitting into train/test:

def stratifiedSplit(X,y,size):
    sss = StratifiedShuffleSplit(y, n_iter=1, test_size=size, random_state=0)

    for train_index, test_index in sss:
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    return X_train, X_test, y_train, y_test

But I don't know how to modify it in this case.

HonzaB
  • 7,065
  • 6
  • 31
  • 42

2 Answers2

34

If the number of samples is the same for every group, or if the proportion is constant for every group, you could try something like

df.groupby('Y').apply(lambda x: x.sample(n=200))

or

df.groupby('Y').apply(lambda x: x.sample(frac=.1))

To perform stratified sampling with respect to more than one variable, just group with respect to more variables. It may be necessary to construct new binned variables to this end.

However, if the group size is too small w.r.t. the proportion like groupsize 1 and propotion .25, then no item will be returned. This is due to pythons rounding implementation of the int function int(0.25)=0

Quickbeam2k1
  • 5,287
  • 2
  • 26
  • 42
  • Lets say I have a DataFrame with 100 000 rows, I want to sample 10 000 from this, but with minimum 10 samples from each group, how would you approach this? With your code I get 10 samples from each group, but that results in a 70k sample – joddm Mar 08 '19 at 12:41
  • This a different problem, as you are not doing random samples per group. What you could do: use my approach to sample the required 10 per group. Than do random sampling on the rest of all of the data and fill up to 10k records. – Quickbeam2k1 Mar 08 '19 at 13:08
  • When I use your approach I get 70k samples. I want to reduce this to 10k while having at least 10 samples from each group that is remaining – joddm Mar 08 '19 at 13:33
  • Instead of frac you can just write e.g 10. Then you'd receive 10 samples from each group. Could you maybe create a new question and link to this one? – Quickbeam2k1 Mar 08 '19 at 13:37
  • 2
    If you want to have a normal DataFrame after this command (and not MultiIndex), execute: `df_test = df_stratified.droplevel(level=0)`. Then you can use the indices to get the train split: `df_train = df[~df.index.isin(df_test.index)]` – NumesSanguis Jun 29 '20 at 06:46
  • why are you using a lambda? df.groupby('Y').sample(frac=.1) also works! – G. Macia Jun 22 '22 at 14:03
  • My edited answer is from 2018. `sample` on a `GroupBy` object is only available since pandas 1.1.0 released in 2020. – Quickbeam2k1 Jun 22 '22 at 17:58
4

I'm not totally sure whether you mean this:

strats = []
for k in range(11):
    y_val = k*0.1
    dummy_df = your_df[your_df['Y'] == y_val]
    stats.append( dummy_df.sample(200) )

That makes a dummy dataframe consisting in only the Y values you want, and then takes a sample of 200.

OK so you need the different chunks to have the same structure. I guess that's a bit harder, here's how I would do it:

First of all, I would get a histogram of what X1 looks like:

hist, edges = np.histogram(your_df['X1'], bins=np.linespace(min_x, max_x, nbins))

we have now a histogram with nbins bins.

Now the strategy is to draw a certain number of rows depending on what their value of X1 is. We will draw more from the bins with more observations and less from the bins with less, so that the structure of X is preserved.

In particular, the relative contribution of every bin should be:

rel = [float(i) / sum(hist) for i in hist]

This will be something like [0.1, 0.2, 0.1, 0.3, 0.3]

If we want 200 samples, we need to draw:

draws_in_bin = [int(i*200) for i in rel]

Now we know how many observations to draw from every bin:

strats = []
for k in range(11):
        y_val = k*0.1

        #get a dataframe for every value of Y
        dummy_df = your_df[your_df['Y'] == y_val]

        bin_strat = []
        for left_edge, right_edge, n_draws in zip(edges[:-1], edges[1:], draws_in_bin):

             bin_df = dummy_df[ (dummy_df['X1']> left_edge) 
                              & (dummy_df['X1']< right_edge) ]

             bin_strat.append(bin_df.sample(n_draws))
             # this takes the right number of draws out 
             # of the X1 bin where we currently are
             # Note that every element of bin_strat is a dataframe
             # with a number of entries that corresponds to the 
             # structure of draws_in_bin
        #
        #concatenate the dataframes for every bin and append to the list
        strats.append( pd.concat(bin_strat) )
Tonechas
  • 13,398
  • 16
  • 46
  • 80
elelias
  • 4,552
  • 5
  • 30
  • 45
  • Ok, this splits the DataFrame into 11 folds and fill each one with 200 rows in a random way. That is one part of my goal. The second is to have those folds stratified, e.g. X1 will have approximately same structure in each fold. – HonzaB Dec 08 '16 at 09:00
  • Actually I think this is all I need. I will relax the constraint of stratification and work with random sample. – HonzaB Dec 08 '16 at 09:23
  • 1
    ah! just replied with how I'd go about it. Take a look if you are interested. It's not very directy or elegant, but I guess it should work. – elelias Dec 08 '16 at 09:29
  • I like the idea how you solved the stratification. Very useful. Thank you! – HonzaB Dec 08 '16 at 09:31
  • In general, if `X1` and `Y` are uncorrelated, you should have stratified results by drawing randomly. With only 200 samples though, well, you will probably observe differences. – elelias Dec 08 '16 at 09:35