Make a new variable, var3, by taking the count of all values of var1 and dividing by unique values of var1, over unique values of var2

Question

I'm trying to plot the average number of clicks on a given day over a one week period. The data i'm working with can be found here if you want to take a look. But briefly, it is structured as so:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400165 entries, 0 to 400164
Data columns (total 9 columns):
uuid               400165 non-null object
timestamp          400165 non-null object
session_id         400165 non-null object
group              400165 non-null object
action             400165 non-null object
checkin            223824 non-null float64
page_id            400165 non-null object
n_results          136234 non-null float64
result_position    230482 non-null float64
dtypes: float64(3), object(6)
memory usage: 27.5+ MB

And the output of df.head() looks like:

    uuid    timestamp   session_id  group   action  checkin page_id n_results   result_position
0   00000736167c507e8ec225bd9e71f9e5    20160301103842  78245c2c3fba013a    b   searchResultPage    NaN cbeb66d1bc1f1bc2    5.0 NaN
1   00000c69fe345268935463abbfa5d5b3    20160307005226  c559c3be98dca8a4    a   searchResultPage    NaN eb658e8722aad674    10.0    NaN
2   00003bfdab715ee59077a3670331b787    20160302145305  760bf89817ce4b08    a   checkin 30.0    f99a9fc1f7fdd21e    NaN NaN
3   0000465cd7c35ad2bdeafec953e08c1a    20160302222701  fb905603d31b2071    a   checkin 60.0    e5626962a6939a75    NaN 10.0
4   000050cbb4ef5b42b16c4d2cf69e6358    20160302095023  c2bf5e5172a892dc    a   checkin 30.0    787dd6a4c371cbf9    NaN NaN

There are 8 unique values of df['date'] and I want to make a new variable that equals (the count of all values of sessionid on a given date) divided by (the count of unique values of sessionid on that same date).

I'd love to do this in one step, but i think i've managed to get part of the way there with this code:

df['sessions_bydate'] = df.groupby('date')['session_id'].nunique().sum()
pd.crosstab(df['date'], df['sessions_bydate'])

#Which produces this output:

test2   68110
date    
2016-03-01  59324
2016-03-02  60322
2016-03-03  60643
2016-03-04  45745
2016-03-05  36242
2016-03-06  41199
2016-03-07  52557
2016-03-08  44129

Interestingly though, this looks like the count of all values of sessionid by date, not the count of all unique values, even though i'm using nunique(). Not sure what i'm doing wrong there.

Ultimately, my objective is to produce a line plot with lowess estimation that shows the average number of clicks per session per date. If i'm approaching it in a strange way, let me know. I'm very open to alternatives.

If helpful, in R, what i'd want to do is:

# Data transformation
CTR <- df %>%
            group_by(date) %>%
            summarise("click_through_rate" = n()/
                        n_distinct(session_id))
# Graphical representation
ggplot(CTR, aes(x = date, y = click_through_rate)) + 
  geom_step() +
  ylim(c(0,NA)) +
  geom_smooth() +
  labs(title = "Daily click-through rate")

EDIT: At the suggestion of one of the comments, i include here a small snippet of data that should have the right structure needed to replicate my problem (obtained using df1.head(18).to_dict()). I've manually made this snippet in excel, and the variables all_values and unique_values are not in my actual data. I think I want to create them in order to make what I actually want: all_values \ unique_values. But if there's a way to skip straight to the finish line, i'm all ears

df = pd.DataFrame({
    'all values': {0: 3, 1: 3, 2: 3, 3: 4, 4: 4, 5: 4, 6: 4, 7: 3, 8: 3, 9: 3, 10: 3, 11: 3, 12: 3, 13: 5, 14: 5, 15: 5, 16: 5, 17: 5},
    'sessionid': {0: '000936ae06d62383', 1: '000936ae06d62383', 2: '0024f4f005f34c9d', 3: '002601319d1a02e1', 4: '0029420a5f8c7d90', 5: '002601319d1a02e1', 6: '002601319d1a02e1', 7: '002ad4609eedc350', 8: '002ad4609eedc350', 9: '002ad4609eedc350', 10: '002b97995ca9ce77', 11: '002b97995ca9ce77', 12: '002ad4609eedc350', 13: '003f93d4791463c9', 14: '003f93d4791463c9', 15: '00429c6a475ab250', 16: '002b97995ca9ce77', 17: '00429c6a475ab250'},
    'unique_values': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2, 5: 2, 6: 2, 7: 1, 8: 1, 9: 1, 10: 2, 11: 2, 12: 2, 13: 3, 14: 3, 15: 3, 16: 3, 17: 3},
    '\ufeffdate': {0: '02/01/2016', 1: '02/01/2016', 2: '02/01/2016', 3: '03/01/2016', 4: '03/01/2016', 5: '03/01/2016', 6: '03/01/2016', 7: '04/01/2016', 8: '04/01/2016', 9: '04/01/2016', 10: '05/01/2016', 11: '05/01/2016', 12: '05/01/2016', 13: '06/01/2016', 14: '06/01/2016', 15: '06/01/2016', 16: '06/01/2016', 17: '06/01/2016'}
    })

Welcome! Yours is a very well structured question. See how to provide pandas df samples here https://stackoverflow.com/q/20109391/6692898 — RichieV, Jul 31 '20 at 12:18
Thanks! I've just added something using `df1.head(18).to_dict()`. Hope that works — C.Robin, Jul 31 '20 at 12:36
your sample seems to be already aggregated... what does `unique_values` mean in that df? Can you provide a similar sample but before applying groupby? — RichieV, Jul 31 '20 at 12:54
`unique_values` and `all values` are variables i want to create. Ultimately I want a variable of `all_values / unique_values`. I've clarified this in the OP — C.Robin, Jul 31 '20 at 13:06

score 2 · Accepted Answer · answered Jul 31 '20 at 13:24

2

Assuming I've understood your statement correctly. It's a simple lambda function in agg()

# df = pd.read_csv("events_log.csv")
df1 = df.loc[:]
df1 = df1.assign(date=pd.to_datetime(df1.loc[:,"timestamp"].astype(np.int64), 
                                     format="%Y%m%d%H%M%S").dt.floor("D"))
dfr = df1.groupby("date").agg({"session_id":lambda s: s.count()/len(s.unique())})
print(dfr.to_string())

output

            session_id
date                  
2016-03-01    6.406479
2016-03-02    6.546066
2016-03-03    6.480445
2016-03-04    5.542954
2016-03-05    5.377151
2016-03-06    5.388438
2016-03-07    5.506338
2016-03-08    5.452063

answered Jul 31 '20 at 13:24

Rob Raymond

29,118
3
14
30

Hi Rob, that's exactly what I was after. Thank you! I'd upvote your answer but I don't have enough points to do that (sorry). Do you know how I could then go about turning these data points into a line graph, with date on the xaxis and the numbers you produced on the yaxis? – C.Robin Jul 31 '20 at 13:41
I must admit I don't do much plotting.... `%matplotlib inline ax = dfr.plot()` in my jupyter notebook does then generate graph you want. Not beautified! Accept my answer and I'll get more rep points :) – Rob Raymond Jul 31 '20 at 13:48
This is great! Thanks a lot Rob. I tried to upvote your answer but I don't have a high enough 'reputation score'. I accepted your answer though :) – C.Robin Aug 01 '20 at 12:27

score 1 · Answer 2 · answered Jul 31 '20 at 13:38

Working with your original data source. First group by date and session_id:

#with gzip.open(pth+'events_log.csv.gz', 'rb') as fo:
#    df = pd.read_csv(fo)
#df.timestamp = pd.to_datetime(df.timestamp.astype(np.int64), format='%Y%m%d%H%M%S')

by_date_id = (
    df.groupby([df.timestamp.dt.date, 'session_id']).uuid.size()
    .rename('rowct').to_frame()
    )

Output:

                             rowct
timestamp  session_id
2016-03-01 000936ae06d62383      1
           001544bc03fac3e8      2
           001a3950cd4ac6c6      5
           001e2d0e159172d2      3
           0022bba0634595b9      1
...                            ...
2016-03-08 ffdf5d155e9b5c19     20
           ffdfd6a8ffbbee8a      1
           ffe1a14d4565573d     17
           fffa387663722d38      1
           fffac2c4111308d0      1

Now to summarize by date:

by_date = by_date_id.groupby('timestamp').agg(['sum', 'size'])
by_date.columns = ['rowct', 'unique_sessions']

Output

            rowct  unique_sessions
timestamp
2016-03-01  59324             9260
2016-03-02  60322             9215
2016-03-03  60644             9358
2016-03-04  45746             8253
2016-03-05  36242             6740
2016-03-06  41200             7646
2016-03-07  52558             9545
2016-03-08  44129             8094

From here on you can divide by_date.rowct / by_date.unique_sesssions

Richie, this is an elegant solution. Thanks a lot! And I appreciate your pointers on how best to share data on SO. — C.Robin, Aug 01 '20 at 12:28

Make a new variable, var3, by taking the count of all values of var1 and dividing by unique values of var1, over unique values of var2

2 Answers2