0

I have a dataframe in pandas that looks like this:

              AID       VID  Freq
0        00016A3E  0127C661     1
1        00016A3E  0C05DA5D     2
2        00016A3E  0C032814     1
3        00016A3E  0BF6C78D     1
4        00016A3E  0A79DFF1     1
5        00016A3E  07BD2FB2     1
6        00016A3E  0790E61B     1
7        00016A3E  0C24ED25     3
8        00016A3E  073630B5     3
9        00016A3E  06613535     1
10       00016A3E  05F809AF     1
11       00016A3E  05C625FF     1
12       00016A3E  04220EA8     4
13       00016A3E  013A29E5     1
14       00016A3E  0761C98A     1
15       00016AE9  0A769475    16
16       00016AE9  0A7DED0A     2
17       00016AE9  0ABF60DF     9
18       00016AFF  0AE3F25A     2
19       00016AFF  0AEFE12F     5
20       00016AFF  0BD8975A     2
21       00016AFF  44DF880B     1
22       00016AFF  43F9E08E     2
23       00016AFF  44EA5E08     2
24       00016AFF  4539ED1E    16
25       00016AFF  8516B55A     4
26       00016AFF  0972AFF2     1
27       00016AFF  0C559B34     1
28       00016AFF  06B5C040     7
29       00016AFF  0B0426FA     1

I want to:

1- Extract all rows where the AID occurs 10 times or more, let's call it df1.

2- From that df1 I want to randomly select 2 rows for each AID and remove them from the df1 into another dataframe, let's call it df2.

So the resulting df's should look like this:

df1: We remove the rows where AID is 00016AE9, since there are only 3 occurrences of it.

            AID       VID  Freq
0        00016A3E  0127C661     1
1        00016A3E  0C05DA5D     2
2        00016A3E  0C032814     1
4        00016A3E  0A79DFF1     1
5        00016A3E  07BD2FB2     1
7        00016A3E  0C24ED25     3
8        00016A3E  073630B5     3
9        00016A3E  06613535     1
10       00016A3E  05F809AF     1
11       00016A3E  05C625FF     1
12       00016A3E  04220EA8     4
13       00016A3E  013A29E5     1
14       00016A3E  0761C98A     1
18       00016AFF  0AE3F25A     2
19       00016AFF  0AEFE12F     5
20       00016AFF  0BD8975A     2
21       00016AFF  44DF880B     1
22       00016AFF  43F9E08E     2
23       00016AFF  44EA5E08     2
24       00016AFF  4539ED1E    16
25       00016AFF  8516B55A     4
26       00016AFF  0972AFF2     1    
29       00016AFF  0B0426FA     1

df2: I want to randomly select two rows from each set of AIDs and put it in df2:

              AID       VID  Freq
3        00016A3E  0BF6C78D     1
6        00016A3E  0790E61B     1
27       00016AFF  0C559B34     1
28       00016AFF  06B5C040     7
BKS
  • 2,227
  • 4
  • 32
  • 53

1 Answers1

1

Here's one way using sample

In [436]: aid_count = df.AID.value_counts()

In [437]: aid_count
Out[437]:
00016A3E    15
00016AFF    12
00016AE9     3
Name: AID, dtype: int64

In [440]: aid_count[aid_count > 10]
Out[440]:
00016A3E    15
00016AFF    12
Name: AID, dtype: int64

Get df1 based on aid_count

In [438]: df1 = df[df.AID.isin(aid_count[aid_count > 10].index)]

3 rows are filtered out from df to be df1

In [441]: df.shape
Out[441]: (30, 3)

In [442]: df1.shape
Out[442]: (27, 3)

Now, for sampling

Take n=2 samples on AID groups by applying the lambda function.

In [439]: df1.groupby('AID', as_index=False).apply(lambda x: x.sample(n=2))
Out[439]:
           AID       VID  Freq
0 1   00016A3E  0C05DA5D     2
  9   00016A3E  06613535     1
1 19  00016AFF  0AEFE12F     5
  22  00016AFF  43F9E08E     2
Zero
  • 74,117
  • 18
  • 147
  • 154
  • This works great. Only I need the remaining "non-chosen" via the sample rows to be in a dataframe. I'm dividing the original dataframe into two one I use for training and the other for testing. – BKS Feb 12 '17 at 15:10
  • I have used it before, but in this instance I need it to randomly select from a subgroup, which is the AIDs. What you did is exactly what I want, I just want the non-chosen sample to also be there. I want `df3 = df - df2` – BKS Feb 12 '17 at 15:19