Removing a set of adjacent rows of a data frame meeting a specific pattern - R

Question

I posted this question on 12/19. I received one response that was very helpful but not quite what I was looking for. Then the question was closed by three folks with the specification it needed more focus. the instructions indicated I could update the question or post a new on but after editing it to make it more focused it remained closed. So, I am posting it again.

Here is the link to the edited question, including a more concise dataset (which had been one critical comment): Identifying a specific pattern in several adjacent rows of a single column - R

But, in case that link isn't allowed, here's the content:

I need to remove a specific set of rows from data when they occur. In our survey, an automated telephone survey, the survey tool will attempt three times during that call to prompt the respondent to enter a response. After three timeouts of the question the survey tool hangs up. This mostly happens when the call goes to someone's voicemail.

I would like to identify that pattern when it happens so I can remove it from calculating call time.

The pattern I am looking for looks like this in the Interactions column:

It doesn't HAVE to be Intro. It can be any part of the survey where it prompting the respondent for a response THREE times but no response is provided so the call fails. But, it does have to be sandwiched in between "Answer" (the phone picks up) and "Timeout. Call failed." (a failure).

I did try to apply what I learned from yesterday's solution (about run length encoding) to my other indexing question but I couldn't make it work in the slightest. So, here I am.

Here's an example dataset:

This is 4 respondents and every interaction between the survey tool and the respondent (or their phone, essentially).

Here's the code for the dataframe: This goes to a Google Drive text editor with the code

The response I got from Rui Barradas was this:

removeRows <- function(X, col = "Interaction", 
                       ans = "Answer", 
                       fail = c("Timeout. Call failed.", "Partial", "Enqueueing call"))
{  
  a <- grep(ans, X[[col]])
  f <- which(X[[col]] %in% fail)
  a <- a[findInterval(f, a)]

  for(i in seq_along(a)){
    X[[col]][a[i]:f[i]] <- NA_character_
  }
  Y <- X[complete.cases(X), , drop = FALSE]
  Y
}

removeRows(survey_data)

However, this solution is too broad. I need to specifically to only remove the rows where 3 attempts are made to prompt a response but no response is provided. So, where the prompt is Intro and there's no response so it times out and eventually the call fails.

Thanks!

Ok, I don't open gd's Please provide some data that fits with https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example and if you need to use the https://github.com/tidyverse/reprex package — Bruno, Jan 06 '20 at 22:29
It does not fit in the character space allowed. This is necessarily long data. It just opens a text editor with the code for the data frame. — JeniFav, Jan 07 '20 at 03:30

score 1 · Accepted Answer · answered Jan 06 '20 at 22:48

I would normally use the dplyr package. I'm sure this method can be modified to use base R if needed but dplyr has pre-made functions to make it easier. Comments in the code to explain what it's doing.

df2 <- df %>%
  # Find any entry where there were three timeouts evenly spaced afterwards and set TRUE.
  # You can add other conditions here if needed (to check even leading values).
  mutate(triple_timeout = ifelse(
    lead(Interaction,n=1) == "Timeout" & 
      lead(Interaction,n=3) == "Timeout" & 
      lead(Interaction,n=5) == "Timeout",
    TRUE,
    FALSE
  )) %>%
  # Lead will have some NA values so fill those in
  mutate(triple_timeout = ifelse(is.na(triple_timeout),FALSE,triple_timeout)) %>%
  # Every triple timeout has six entries that should be true, but only the first is id'd.
  # Use an `or` logic and lag statements to set value to true for 5 entries after any TRUE
  mutate(triple_timeout = triple_timeout | 
           lag(triple_timeout,n=1) |
           lag(triple_timeout,n=2) |
           lag(triple_timeout,n=3) |
           lag(triple_timeout,n=4) |
           lag(triple_timeout,n=5)
         ) %>%
  # Lag will have some NA values to fill those in
  mutate(triple_timeout = ifelse(is.na(triple_timeout),FALSE,triple_timeout)) %>%
  # Filter out any TRUE triple_filter
  filter(!triple_timeout) %>%
  # Remove the filter column
  select(-triple_timeout)

Thanks! I'll give it a shot first thing in the morning and circle back. — JeniFav, Jan 07 '20 at 03:32
This worked great, Adam! And I was able to learn how to tweak it if I want to do something similar for an expanded set of rows. Thanks a whole lot! — JeniFav, Jan 07 '20 at 15:18
Glad it could help. The one thing to be aware of is that this method may use a fair amount of memory (which may matter on really big datasets with limited RAM). But it is generally fast and uses vector operations. — Adam Sampson, Jan 07 '20 at 18:18

score 0 · Answer 2 · answered Jan 08 '20 at 01:37

0

I'll know for sure in the coming month when I have this kind of data for 5K respondents. But I have decent RAM. Thanks, again!

answered Jan 08 '20 at 01:37

JeniFav

113
1
9

Removing a set of adjacent rows of a data frame meeting a specific pattern - R

2 Answers2