Finding a missing value in a sequential data

Question

Suppose I have this dataframe, df, in R:

UserID <- c(1, 1, 1, 5, 5, 7, 7, 9, 9, 9)
PathID <- c(1,2,3,1,2,1,2,1,2,3)
Page <- c("home", "about", "services", "home", "pricing", 
"pricing", "home", "about", "home", "services")
df <- data.frame(UserID, PathID, Page)

I am trying to write a code that would return the sequence (along with UserID and PathID) where the user visits the 'home' page, but not the 'about' page subsequently. My output should look like this:

UserID <- c(5, 5, 7, 7, 9, 9, 9)
PathID <- c(1,2,1,2,1,2,3)
Page <- c("home", "pricing", "pricing", "home", "about", "home", "services")
df1 <- data.frame(UserID, PathID, Page)

I would really appreciate some help here.

The code for your desired result does not work: `Error in data.frame(UserID, PathID, Page): arguments imply differing number of rows: 7, 10` — VvdL, Sep 28 '22 at 15:17

score 1 · Answer 1 · answered Sep 28 '22 at 15:17

With a couple of filtering conditions, you can remove the all group (!any) if it has a sequence of "home", "about".

library(dplyr)
df %>% 
  group_by(UserID) %>% 
  filter(!any(Page == "about" & lag(Page, default = "nothome") == "home"))

  UserID PathID Page     
1      5      1 home    
2      5      2 pricing 
3      7      1 pricing 
4      7      2 home    
5      9      1 about   
6      9      2 home    
7      9      3 services

score 0 · Answer 2 · answered Sep 28 '22 at 15:26

An option with data.table

library(data.table)
 setDT(df)[df[,  .I[!any(Page == "about" & 
   shift(Page) == "home", na.rm = TRUE)], UserID]$V1]
   UserID PathID     Page
    <num>  <num>   <char>
1:      5      1     home
2:      5      2  pricing
3:      7      1  pricing
4:      7      2     home
5:      9      1    about
6:      9      2     home
7:      9      3 services

Finding a missing value in a sequential data

2 Answers2