1

Suppose I have this dataframe, df, in R:

UserID <- c(1, 1, 1, 5, 5, 7, 7, 9, 9, 9)
PathID <- c(1,2,3,1,2,1,2,1,2,3)
Page <- c("home", "about", "services", "home", "pricing", 
"pricing", "home", "about", "home", "services")
df <- data.frame(UserID, PathID, Page)

I am trying to write a code that would return the sequence (along with UserID and PathID) where the user visits the 'home' page, but not the 'about' page subsequently. My output should look like this:

UserID <- c(5, 5, 7, 7, 9, 9, 9)
PathID <- c(1,2,1,2,1,2,3)
Page <- c("home", "pricing", "pricing", "home", "about", "home", "services")
df1 <- data.frame(UserID, PathID, Page)

I would really appreciate some help here.

Maël
  • 45,206
  • 3
  • 29
  • 67
user2845095
  • 465
  • 2
  • 9
  • The code for your desired result does not work: `Error in data.frame(UserID, PathID, Page): arguments imply differing number of rows: 7, 10` – VvdL Sep 28 '22 at 15:17

2 Answers2

1

With a couple of filtering conditions, you can remove the all group (!any) if it has a sequence of "home", "about".

library(dplyr)
df %>% 
  group_by(UserID) %>% 
  filter(!any(Page == "about" & lag(Page, default = "nothome") == "home"))
  UserID PathID Page     
1      5      1 home    
2      5      2 pricing 
3      7      1 pricing 
4      7      2 home    
5      9      1 about   
6      9      2 home    
7      9      3 services
Maël
  • 45,206
  • 3
  • 29
  • 67
0

An option with data.table

library(data.table)
 setDT(df)[df[,  .I[!any(Page == "about" & 
   shift(Page) == "home", na.rm = TRUE)], UserID]$V1]
   UserID PathID     Page
    <num>  <num>   <char>
1:      5      1     home
2:      5      2  pricing
3:      7      1  pricing
4:      7      2     home
5:      9      1    about
6:      9      2     home
7:      9      3 services
akrun
  • 874,273
  • 37
  • 540
  • 662