Get n users from pandas dataframe by id

Question

This is a mock dataframe.

df_test = pd.DataFrame({
  'ID': [8972685, 8972685, 8972685, 8972685, 8972685, 8972685, 9834561, 9834561, 9834561, 9834561, 9834561, 9834561],
  'POST': ['texteghteh', 'tethrtxt', 'tetrhrtxt', 'terthtrxt', 'teetrwxt', 'twetrhext', 'tethdxt', 'texthdt', 'texdhtrt', 'texdthdt', 'tdghgdhtext', 'tthtdext']
})

Basically the bigger dataframe contains approximately 90000 distinct users and 28000000 rows. Each row contains a post made by some user. What I want is to pick n users from the dataframe along with their posts. Let's say I want to pick the first 500 users and each has 1000 posts. Basically I need to obtain 500000 rows.

I previously asked this and it was instantly marked as duplicate which I think it's not. This is another answer but I did not manage to apply those solutions successfully. I need it the other way round. First n groups regardless of entries.

I tried this:

df_test.groupby('ID')['POST'].head(2)

which yields:

0    texteghteh
1      tethrtxt
6       tethdxt
7       texthdt
Name: POST, dtype: object

This gives me the first two posts from each user. I want to see the 2 users with their posts.

"I did not manage to apply those solutions successfully" please provide a [mcve] showing code for what you tried, and describe in detail what was wrong with your output. Without the detail of what went wrong, the task itself is definitely a duplicate of the listed question — G. Anderson, Nov 17 '20 at 17:05

score 1 · Accepted Answer · answered Nov 17 '20 at 17:07

Depending how you would sample the users and their posts. For example, if you want to get the first 500 users with at least 1000 posts:

n_users, min_posts = 500, 1000
groups = df_test.groupby('ID')
sizes = groups.size()

# get the first n_users with at list min_posts
users = sizes[sizes>=min_posts].head(n_users).index

Now, if you don't want to get the first users, but rather sample them randomly, you can do:

users = sizes[sizes>=min_posts].sample(n_users).index

Once you have the users, you can filter with isin:

df_test[df_test['ID'].isin(users)]

And you can use the same logic with either groupby().head() or groupby().sample() to sample this data. For example, sample randomly min_posts for each of these users:

df_test[df_test['ID'].isin(users)].groupby('ID').sample(min_posts)

This is exactly what I needed. Thanks a lot! – Andreea-Codrina Moldovan Nov 17 '20 at 17:31 — Andreea-Codrina Moldovan, Nov 17 '20 at 17:31

Get n users from pandas dataframe by id

1 Answers1