3

I have a data frame where the rows represent a transaction done by a certain user. Note that more than one row can have the same user_id. Given the column names gender and user_id running:

df.gender.value_counts()

returns the frequencies but they are spurious since they may be possibly counting a given user more than once. So for example, it may tell me there are 50 male individuals while they are actually much less.

Is there a way I can condition value_counts() to count only once per user_id?

Kevin Zakka
  • 445
  • 7
  • 19
  • Possible duplicate of [Count unique values with pandas](http://stackoverflow.com/questions/38309729/count-unique-values-with-pandas) – ayhan Jul 12 '16 at 11:56
  • I wonder why you don't select unique `user_id` and group by `gender` afterwards. Hopefully, your users don't change their gender too often.. – jbndlr Jul 12 '16 at 14:42

2 Answers2

4

You want to use panda's groupby on your dataframe:

users = {'A': 'male', 'B': 'female', 'C': 'female'}
ul = [{'id': k, 'gender': users[k]} for _ in range(50) for k in random.choice(users.keys())]
df = pd.DataFrame(ul)

print(df.groupby('gender')['id'].nunique())

This yields (depending on fortune's random choice, but chances are "quite high" that each of three keys is chosen at least once for 50 samples):

gender
female    2
male      1
Name: id, dtype: int64
jbndlr
  • 4,965
  • 2
  • 21
  • 31
0

I agree with the first post but just to make the groupby simpler:

df.groupby('user_id').first().count() will give you counts of each variable

or alternatively:

pd.value_counts(df.groupby('user_id').first().reset_index().gender)
A.Kot
  • 7,615
  • 2
  • 22
  • 24