Creating multiple subsets from single data frame, without replacement

Question

I am trying to create 10 different subsets of 5 Members without replacement from this data (in Python):

      Member CIN Needs Assessment Network Enrolled
117   CS38976K                1                1
118   GN31829N                1                1
119   GD98216H                1                1
120   VJ71307A                1                1
121   OX22563R                1                1
122   YW35494W                1                1
123   QX20765B                1                1
124   NO50548K                1                1
125   VX90647K                1                1
126   RG21661H                1                1
127   IT17216C                1                1
128   LD81088I                1                1
129   UZ49716O                1                1
130   UA16736M                1                1
131   GN07797S                1                1
132   TN64827F                1                1
133   MZ23779M                1                1
134   UG76487P                1                1
135   CY90885V                1                1
136   NZ74233H                1                1
137   CB59280X                1                1
138   LI89002Q                1                1
139   LO64230I                1                1
140   NY27508Q                1                1
141   GU30027P                1                1
142   XJ75065T                1                1
143   OW40240P                1                1
144   JQ23187C                1                1
145   PQ45586F                1                1
146   IM59460P                1                1
147   OU17576V                1                1
148   KL75129O                1                1
149   XI38543M                1                1
150   PO09602E                1                1
151   PS27561N                1                1
152   PC63391R                1                1
153   WR70847S                1                1
154   XL19132L                1                1
155   ZX27683R                1                1
156   MZ63663M                1                1
157   FT35723P                1                1
158   NX90823W                1                1
159   SC16809F                1                1
160   TX83955R                1                1
161   JA79273O                1                1
162   SK66781D                1                1
163   UK69813N                1                1
164   CX01143B                1                1
165   MT45485A                1                1
166   LJ25921O                1                1

I tried using MANY variations of random.sample() for _ in range(). Nothing is working. Nothing so far on stack overflow seems to give me the result I need.

Hi! Exaclty, what is the result you need? Could you clarify it, please? — Valentino, Apr 22 '19 at 14:02
Hi, there are 50 members in this master data frame, I am trying to use these fake members for a rolling 10 months attribution to a program. I need to use 5 unique members monthly. So 5 unique IDs in the first subset, then 5 unique in the next...etc, from this master data frame. — TP89, Apr 22 '19 at 14:05
The IDs should be unique only in the respective subset? In other words, the same ID can appear in two different subsets in different months? — Valentino, Apr 22 '19 at 14:07
Each ID can only appear in one subset, never repeated again. — TP89, Apr 22 '19 at 14:10

score 2 · Accepted Answer · answered Apr 22 '19 at 14:31

Here a solution using pandas.

Say that master is your master dataframe created with pandas, you can do:

shuffled = master.sample(frac=1)

This creates a copy of your master dataframe with rows randomly reordered. See this answer on stackoverflow or the docs for the sample method.
Then you can simply build 10 smaller dataframes of five rows going in order.

subsets = []
for i in range(10):
    subdf = shuffled.iloc[(i*5):(i+1)*5]
    subsets.append(subdf)

subsets is the list containing your small dataframes. Do:

for sub in subsets:
    print(sub)

to print them all and verify by eye that there are not repetitions.

score 1 · Answer 2 · answered Apr 22 '19 at 14:04

This seems like a combination problem. Here is a solution: You should create your list, say L. Then you decide the size of the subset, say r. After that here is the code:

from itertools import combinations combinations(L,r)

However if you don't want to decide the size of the set to be created, then you can use random module as follows:

import random from itertools import combinations combinations(L,r = random(a,b))

In this case, this will create a random set of r (which is random integer between a and b) elements from the list L. If you want to do that 10 times, you can make a for loop.

I hope that works for you.

score 1 · Answer 3 · answered Apr 22 '19 at 14:17

Let's assume that we have lines variable with an iterator of your dataset. Then:

from random import sample

# Chunk length
chunk_len = 2

# Number of chunks
num_of_chunks = 5

# Get the sample with data for all chunks. It guarantees us that there will
# be no repetitions
random_sample = sample(lines, num_of_chunks*chunk_len)

# Construct the list with chunks
result = [random_sample[i::num_of_chunks] for i in range(num_of_chunks)]
result

Will return:

[['123   QX20765B                1                1',
  '118   GN31829N                1                1'],
 ['127   IT17216C                1                1',
  '122   YW35494W                1                1'],
 ['138   LI89002Q                1                1',
  '126   RG21661H                1                1'],
 ['120   VJ71307A                1                1',
  '121   OX22563R                1                1'],
 ['143   OW40240P                1                1',
  '142   XJ75065T                1                1']]

Creating multiple subsets from single data frame, without replacement

3 Answers3