I have some clickstream data I'd like to attribution analyze in a particular way, but I need to get into a specific format for users that convert and those that don't.
Reprex data:
df <- structure(list(User_ID = c(2001, 2001, 2001, 2002, 2001, 2002,
2001, 2002, 2002, 2003, 2003, 2001, 2002, 2002, 2001), Session_ID = c("1001",
"1002", "1003", "1004", "1005", "1006", "1007", "Not Set", "Not Set",
"Not Set", "Not Set", "Not Set", "1008", "1009", "Not Set"),
Date_time = structure(c(1540103940, 1540104060, 1540104240,
1540318080, 1540318680, 1540318859, 1540314360, 1540413060,
1540413240, 1540538460, 1540538640, 1540629660, 1540755060,
1540755240, 1540803000), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Source = c("Facebook", "Facebook", "Facebook", "Google",
"Email", "Google", "Email", "Referral", "Referral", "Facebook",
"Facebook", "Google", "Referral", "Direct", "Direct"), Conversion = c(0,
0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -15L), spec = structure(list(
cols = list(User_ID = structure(list(), class = c("collector_double",
"collector")), Session_ID = structure(list(), class = c("collector_character",
"collector")), Date_time = structure(list(format = ""), class = c("collector_datetime",
"collector")), Source = structure(list(), class = c("collector_character",
"collector")), Conversion = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
Then set classes:
df <- df %>%
mutate(User_ID = as.factor(User_ID),
Session_ID = as.factor(Session_ID),
Date_time = as.POSIXct(Date_time)
)
I'd like to get all user visit paths to purchase, or total paths for ones that do not lead to purchase.
The format for the new column path
would be for example: Facebook > Facebook > Facebook > Email > Email
for user 2001 which I know how to achieve using
mutate(path = paste0(source, collapse = " > "))
The complications are:
- The majority of session IDs are not set, meaning they're missing
- Some users may convert more than once
- Some users may convert and return but not convert
Each row would be either:
- A conversion by user ID - most converted users only convert once, but
some may convert multiple times in which case there would be a row
per conversion. The
path
column would reflect the journey to conversion - for a user's second or subsequent conversion only the path subsequent to the previous conversion would be shown. - Or a non-converted user-journey with their total path in the above format
For the above reprex the result would look like below:
# A tibble: 5 x 5
User_ID Session_ID Date_time Conversion Path
<dbl> <chr> <dttm> <dbl> <chr>
1 2001 1007 2018-10-23 17:06:00 1 Facebook > Facebook > Facebook > Email > Email
2 2002 Not Set 2018-10-24 20:34:00 1 Google > Google > Referral > Referral
3 2003 Not Set 2018-10-26 07:24:00 0 Facebook > Facebook
4 2002 1009 2018-10-28 19:34:00 0 Referral > Direct
5 2001 Not Set 2018-10-29 08:50:00 1 Google > Direct
... where:
- user 2001 converted twice and the paths are represented separately;
- user 2002 converted and then came back later but did not convert and so the converted and non converted paths are represented as separate rows.
- User 2003 never converted and so this path is represented.