3

I am working with weblogs and have data containing account_id and session_id. Multiple sessions can be associated with one account. I want to create a new dataframe containing account_id and count the number of unique sessions associated with that account. My df looks like this:

account_id session_id
 1111          de322
 1111          de322
 1111          de322
 1111          de323
 1111          de323
 0210          ge012
 0210          ge013
 0211          ge330
 0213          ge333

I'm using this code:

new_df = df.groupby(['account_id','session_id']).sum()

The output I am getting is below:

 account_id     sessions
 1111           de322
                de323
 0210           ge012 
                ge013 
 0211           ge330
 0213           ge333

The output I'm expecting

account_id   sessions
 1111           2
 0210           2  
 0211           1
 0213           1

How should I fix it?

Tadas Melnikas
  • 87
  • 2
  • 3
  • 12

1 Answers1

3
df = pd.DataFrame({'session': ['de322', 'de322', 'de322', 'de323', 'de323', 'ge012', 'ge012', 'ge013', 'ge333'],
                   'user_id': [1111, 1111, 1111, 1111, 1111, 210, 210, 210, 211],
                   })
print(df)


df = df.drop_duplicates().groupby('user_id').count()
print(df)

output:

user_id
210     2
211     1
1111    2
Nihal
  • 5,262
  • 7
  • 23
  • 41