I'm trying to plot the average number of clicks on a given day over a one week period. The data i'm working with can be found here if you want to take a look. But briefly, it is structured as so:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400165 entries, 0 to 400164
Data columns (total 9 columns):
uuid 400165 non-null object
timestamp 400165 non-null object
session_id 400165 non-null object
group 400165 non-null object
action 400165 non-null object
checkin 223824 non-null float64
page_id 400165 non-null object
n_results 136234 non-null float64
result_position 230482 non-null float64
dtypes: float64(3), object(6)
memory usage: 27.5+ MB
And the output of df.head()
looks like:
uuid timestamp session_id group action checkin page_id n_results result_position
0 00000736167c507e8ec225bd9e71f9e5 20160301103842 78245c2c3fba013a b searchResultPage NaN cbeb66d1bc1f1bc2 5.0 NaN
1 00000c69fe345268935463abbfa5d5b3 20160307005226 c559c3be98dca8a4 a searchResultPage NaN eb658e8722aad674 10.0 NaN
2 00003bfdab715ee59077a3670331b787 20160302145305 760bf89817ce4b08 a checkin 30.0 f99a9fc1f7fdd21e NaN NaN
3 0000465cd7c35ad2bdeafec953e08c1a 20160302222701 fb905603d31b2071 a checkin 60.0 e5626962a6939a75 NaN 10.0
4 000050cbb4ef5b42b16c4d2cf69e6358 20160302095023 c2bf5e5172a892dc a checkin 30.0 787dd6a4c371cbf9 NaN NaN
There are 8 unique values of df['date']
and I want to make a new variable that equals (the count of all values of sessionid
on a given date) divided by (the count of unique values of sessionid
on that same date).
I'd love to do this in one step, but i think i've managed to get part of the way there with this code:
df['sessions_bydate'] = df.groupby('date')['session_id'].nunique().sum()
pd.crosstab(df['date'], df['sessions_bydate'])
#Which produces this output:
test2 68110
date
2016-03-01 59324
2016-03-02 60322
2016-03-03 60643
2016-03-04 45745
2016-03-05 36242
2016-03-06 41199
2016-03-07 52557
2016-03-08 44129
Interestingly though, this looks like the count of all values of sessionid
by date, not the count of all unique values, even though i'm using nunique()
. Not sure what i'm doing wrong there.
Ultimately, my objective is to produce a line plot with lowess estimation that shows the average number of clicks per session per date. If i'm approaching it in a strange way, let me know. I'm very open to alternatives.
If helpful, in R, what i'd want to do is:
# Data transformation
CTR <- df %>%
group_by(date) %>%
summarise("click_through_rate" = n()/
n_distinct(session_id))
# Graphical representation
ggplot(CTR, aes(x = date, y = click_through_rate)) +
geom_step() +
ylim(c(0,NA)) +
geom_smooth() +
labs(title = "Daily click-through rate")
EDIT:
At the suggestion of one of the comments, i include here a small snippet of data that should have the right structure needed to replicate my problem (obtained using df1.head(18).to_dict()
). I've manually made this snippet in excel, and the variables all_values
and unique_values
are not in my actual data. I think I want to create them in order to make what I actually want: all_values \ unique_values
. But if there's a way to skip straight to the finish line, i'm all ears
df = pd.DataFrame({
'all values': {0: 3, 1: 3, 2: 3, 3: 4, 4: 4, 5: 4, 6: 4, 7: 3, 8: 3, 9: 3, 10: 3, 11: 3, 12: 3, 13: 5, 14: 5, 15: 5, 16: 5, 17: 5},
'sessionid': {0: '000936ae06d62383', 1: '000936ae06d62383', 2: '0024f4f005f34c9d', 3: '002601319d1a02e1', 4: '0029420a5f8c7d90', 5: '002601319d1a02e1', 6: '002601319d1a02e1', 7: '002ad4609eedc350', 8: '002ad4609eedc350', 9: '002ad4609eedc350', 10: '002b97995ca9ce77', 11: '002b97995ca9ce77', 12: '002ad4609eedc350', 13: '003f93d4791463c9', 14: '003f93d4791463c9', 15: '00429c6a475ab250', 16: '002b97995ca9ce77', 17: '00429c6a475ab250'},
'unique_values': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2, 5: 2, 6: 2, 7: 1, 8: 1, 9: 1, 10: 2, 11: 2, 12: 2, 13: 3, 14: 3, 15: 3, 16: 3, 17: 3},
'\ufeffdate': {0: '02/01/2016', 1: '02/01/2016', 2: '02/01/2016', 3: '03/01/2016', 4: '03/01/2016', 5: '03/01/2016', 6: '03/01/2016', 7: '04/01/2016', 8: '04/01/2016', 9: '04/01/2016', 10: '05/01/2016', 11: '05/01/2016', 12: '05/01/2016', 13: '06/01/2016', 14: '06/01/2016', 15: '06/01/2016', 16: '06/01/2016', 17: '06/01/2016'}
})