Counting Unique Values of Categories of Column Given Condition on other Column

Question

I have a data frame where the rows represent a transaction done by a certain user. Note that more than one row can have the same user_id. Given the column names gender and user_id running:

df.gender.value_counts()

returns the frequencies but they are spurious since they may be possibly counting a given user more than once. So for example, it may tell me there are 50 male individuals while they are actually much less.

Is there a way I can condition value_counts() to count only once per user_id?

Possible duplicate of [Count unique values with pandas](http://stackoverflow.com/questions/38309729/count-unique-values-with-pandas) — ayhan, Jul 12 '16 at 11:56
I wonder why you don't select unique `user_id` and group by `gender` afterwards. Hopefully, your users don't change their gender too often.. — jbndlr, Jul 12 '16 at 14:42

score 4 · Accepted Answer · answered Jul 12 '16 at 12:14

You want to use panda's groupby on your dataframe:

users = {'A': 'male', 'B': 'female', 'C': 'female'}
ul = [{'id': k, 'gender': users[k]} for _ in range(50) for k in random.choice(users.keys())]
df = pd.DataFrame(ul)

print(df.groupby('gender')['id'].nunique())

This yields (depending on fortune's random choice, but chances are "quite high" that each of three keys is chosen at least once for 50 samples):

gender
female    2
male      1
Name: id, dtype: int64

score 0 · Answer 2 · answered Jul 12 '16 at 13:35

0

I agree with the first post but just to make the groupby simpler:

df.groupby('user_id').first().count() will give you counts of each variable

or alternatively:

pd.value_counts(df.groupby('user_id').first().reset_index().gender)

answered Jul 12 '16 at 13:35

A.Kot

7,615
2
22
24

Counting Unique Values of Categories of Column Given Condition on other Column

2 Answers2