17

I have a large pandas dataframe with about 10,000,000 rows. Each one represents a feature vector. The feature vectors come in natural groups and the group label is in a column called group_id. I would like to randomly sample 10% say of the rows but in proportion to the numbers of each group_id.

For example, if the group_id's are A, B, A, C, A, B then I would like half of my sampled rows to have group_id A, two sixths to have group_id B and one sixth to have group_id C.

I can see the pandas function sample but I am not sure how to use it to achieve this goal.

Simd
  • 19,447
  • 42
  • 136
  • 271

4 Answers4

28

You can use groupby and sample

sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=0.1))
Vaishali
  • 37,545
  • 5
  • 58
  • 86
  • 3
    Let me just remark that with this code it might happen that groups don't appear in the result. Specifically, this happenw if fraction times size of group is less than one (even if it's ,9). So with a fraction of 10% you will only obtain samples of groups that contain at least 10 elements. This might not be an issue, though – Quickbeam2k1 Sep 03 '17 at 19:23
5

the following sample a total of N row where each group appear in its original proportion to the nearest integer, then shuffle and reset the index using:

df = pd.DataFrame(dict(
    A=[1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4],
    B=range(20)
))

Short and sweet:

df.sample(n=N, weights='A', random_state=1).reset_index(drop=True)

Long version

df.groupby('A', group_keys=False).apply(lambda x: x.sample(int(np.rint(N*len(x)/len(df))))).sample(frac=1).reset_index(drop=True)
irkinosor
  • 766
  • 12
  • 26
3

I was looking for similar solution. The code provided by @Vaishali works absolutely fine. What @Abdou's trying to do also makes sense when we want to extract samples from each group based on their proportions to the full data.

# original : 10% from each group
sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=0.1))

# modified : sample size based on proportions of group size
n = df.shape[0]
sample_df = df.groupby('group_id').apply(lambda x: x.sample(frac=length(x)/n))
  • Notice that you may want to replace `length` with `len` (as there is no `length` method in python). I tried to edit the post myself but the ridiculous rule that editing must be of at least 6 characters prevented me to do so :). – gented Sep 24 '19 at 23:08
2

This is not as simple as just grouping and using .sample. You need to actually get the fractions first. Since you said that you are looking to grab 10% of the total numbers of rows in different proportions, you will need to calculate how much each group will have to take out from the main dataframe. For instance, if we use the divide you mentioned in the question, then group A will end up with 1/20 for a fraction of the total number of rows, group B will get 1/30 and group C ends up with 1/60. You can put these fractions in a dictionary and then use .groupby and pd.concat to concatenate the number of rows* from each group into a dataframe. You will be using the n parameter from the .sample method instead of the frac parameter.

fracs = {'A': 1/20, 'B': 1/30, 'C': 1/60}
N = len(df)
pd.concat(dff.sample(n=int(fracs.get(i)*N)) for i,dff in df.groupby('group_id'))

Edit:

This is to highlight the importance in fulfilling the requirement that group_id A should have half of the sampled rows, group_id B two sixths of the sampled rows and group_id C one sixth of the sampled rows, regardless of the original group divides.

Starting with equal portions: each group starts with 40 rows

df1 = pd.DataFrame({'group_id': ['A','B', 'C']*40,
                   'vals': np.random.randn(120)})
N = len(df1)
fracs = {'A': 1/20, 'B': 1/30, 'C': 1/60}
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df1.groupby('group_id')))

#     group_id      vals
# 12         A -0.175109
# 51         A -1.936231
# 81         A  2.057427
# 111        A  0.851301
# 114        A  0.669910
# 60         A  1.226954
# 73         B -0.166516
# 82         B  0.662789
# 94         B -0.863640
# 31         B  0.188097
# 101        C  1.802802
# 53         C  0.696984


print(df1.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))

#              group_id      vals
# group_id
# A        24         A  0.161328
#          21         A -1.399320
#          30         A -0.115725
#          114        A  0.669910
# B        34         B -0.348558
#          7          B -0.855432
#          106        B -1.163899
#          79         B  0.532049
# C        65         C -2.836438
#          95         C  1.701192
#          80         C -0.421549
#          74         C -1.089400

First solution: 6 rows for group A (1/2 of the sampled rows), 4 rows for group B (one third of the sampled rows) and 2 rows for group C (one sixth of the sampled rows).

Second solution: 4 rows for each group (each one third of the sampled rows)


Working with differently sized groups: 40 for A, 60 for B and 20 for C

df2 = pd.DataFrame({'group_id': np.repeat(['A', 'B', 'C'], (40, 60, 20)),
                   'vals': np.random.randn(120)})
N = len(df2)
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df2.groupby('group_id')))

#     group_id      vals
# 29         A  0.306738
# 35         A  1.785479
# 21         A -0.119405
# 4          A  2.579824
# 5          A  1.138887
# 11         A  0.566093
# 80         B  1.207676
# 41         B -0.577513
# 44         B  0.286967
# 77         B  0.402427
# 103        C -1.760442
# 114        C  0.717776

print(df2.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))

#              group_id      vals
# group_id
# A        4          A  2.579824
#          32         A  0.451882
#          5          A  1.138887
#          17         A -0.614331
# B        47         B -0.308123
#          52         B -1.504321
#          42         B -0.547335
#          84         B -1.398953
#          61         B  1.679014
#          66         B  0.546688
# C        105        C  0.988320
#          107        C  0.698790

First solution: consistent Second solution: Now group B has taken 6 of the sampled rows when it's supposed to only take 4.


Working with another set of differently sized groups: 60 for A, 40 for B and 20 for C

df3 = pd.DataFrame({'group_id': np.repeat(['A', 'B', 'C'], (60, 40, 20)),
                   'vals': np.random.randn(120)})
N = len(df3)
print(pd.concat(dff.sample(n=int(fracs.get(i) * N)) for i,dff in df3.groupby('group_id')))

#     group_id      vals
# 48         A  1.214525
# 19         A -0.237562
# 0          A  3.385037
# 11         A  1.948405
# 8          A  0.696629
# 39         A -0.422851
# 62         B  1.669020
# 94         B  0.037814
# 67         B  0.627173
# 93         B  0.696366
# 104        C  0.616140
# 113        C  0.577033

print(df3.groupby('group_id').apply(lambda x: x.sample(frac=0.1)))

#              group_id      vals
# group_id
# A        4          A  0.284448
#          11         A  1.948405
#          8          A  0.696629
#          0          A  3.385037
#          31         A  0.579405
#          24         A -0.309709
# B        70         B -0.480442
#          69         B -0.317613
#          96         B -0.930522
#          80         B -1.184937
# C        101        C  0.420421
#          106        C  0.058900

This is the only time the second solution offered some consistency (out of sheer luck, I might add).

I hope this proves useful.

Abdou
  • 12,931
  • 4
  • 39
  • 42
  • 1
    so the base split is `1/2, 1/3, 1/6` and you multiply by `1/10`which gives your numbers. But you are suggesting, that in group a, you would only take, 1/20 of the elements. So, say we have 60 elements in group A, 40 in B and 20 in C, so 120 in total. Sampling 10% overall shall give, 12 elements, sampling 10% of each group gives, 6, 4 and 2 (which sum up to 12). your method gives, 3, 2, and 1 elements which are only 6 elements in total or overall 5% – Quickbeam2k1 Sep 03 '17 at 19:54
  • I think I forgot to add that I was using the fractions to get the actual numbers of rows. So group `A` would take `(1/20) * len(df)`. Instead of using `frac`, it should use the `n` parameter. – Abdou Sep 03 '17 at 19:58
  • but where is the difference to the other solution? I don't see where you solution behaves differently than Vaishali's solution except for more unnecessary code. – Quickbeam2k1 Sep 03 '17 at 20:17
  • In mine, group A is always guaranteed to take up half of the resulting dataframe while groups B and C grab one third and one sixth, respectively. I doubt that that is guaranteed with the other solution. I could be wrong. – Abdou Sep 03 '17 at 20:21
  • If the group fractions are a bit odd, your guarantee of same fraction on every fractions won't hold. Besides, the same guarantee holds for the other answer. Doing the math by hand, you'll see that there is no difference (12 = 120*0.1 = (60+40+20)*0.1 = (6+4+2) = 1/20*120 + 1/30 *120 +1/60 *120= (120 * 0.1 * (1/2 + 1/3 + 1/6))) – Quickbeam2k1 Sep 03 '17 at 20:29
  • What if you started with 40 for each group? You don't get the guarantee when you use the other solution in that case. You can check for yourself. Generate data and test this. Also, please test those *fractions that are a bit odd* when you get a chance. – Abdou Sep 03 '17 at 20:38
  • if i have 40 elements per group, the only correct solution is to have 4 elements from every group (and i checked it) This is obtained by@Vaishali's solution. Additionally, as said before the math is clear. Or could you provide context on where you solution provides a better result? – Quickbeam2k1 Sep 03 '17 at 20:53
  • Each group gets 4 and you think 4 is both one sixth of 12 and two sixths of 12? The sampled rows length is 12, then you want 6 for group A, 4 for group B and 2 for group C. – Abdou Sep 03 '17 at 21:06
  • @Quickbeam2k1 when you get a chance, please review my edits and let me know if I am being any clearer. – Abdou Sep 03 '17 at 21:25
  • 1
    The criteria that OP has specified about A beiing half, B 2/6th and C 1/6th is because he is giving an example of group_id being A, B, A, C, A, B in which case clearly A is 3/6 (half), B 2/6 and C 1/6. The main idea is to sample 10% of the rows but in proportion to the numbers of each group_id. Your sample df1 - proportion of A, B and C is same and hence the solution I proposed selected 4 rows each and so on.... – Vaishali Sep 03 '17 at 23:37