0

I have a data frame with a daily value for several users. The users have different start dates, so I have assigned NA for the values before the first use and zero-values for any cells without values thereafter. I have used the following loop to do this:

for (i in seq_along(df)) {
 isna <- is.na(df[[i]])
 nonna <- match(FALSE,isna)
 id <- which(isna)
 df[[i]][id[id>nonna]] <- 0
}

However, some of the users have a lot of zero-values towards the end, indicating that they have stopped using the service. I would like to set also these values to NA, if there are more than 100 zero-values in the end of the data frame. I have not succeeded in doing this, and any suggestions would be appreciated.

Ase
  • 3
  • 1
  • 3
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Jul 17 '20 at 19:55

1 Answers1

0

I think I understand your problem, so let me restate it so you can tell me if I am wrong.

You have a data frame where columns represent users, and rows represent days. Taking a column out of your data frame with df[[i]] will therefore give you a time series for one user's activity.

The users didn't all start on the same day, so some of these time series may have a long initial run of 0 activity. This indicates that the user was not yet with your service, and should be NA instead of 0. We can therefore assume everything prior to the date of the first non-zero number should be NA.

Some users have 0 activity on some days after joining your service. This just means they aren't using your service on that day. However, if they leave your service altogether, they will generate a long run of zeros up to the end of their column from the point at which they left.

Some users might have a few 0s at the end of the data frame by chance - they have not left the service, but just happen not to have used it for a few days at the time point when the data frame stops. These 0s should not be converted to NA values. However, if the user has more than 100 consecutive days of zero activity ongoing by the end of their column, all the zeros at the end should be converted to NA.

Assuming this is what you mean, and assuming there are no NA values to start with in your columns, we can solve the problem with run length encoding. I have commented each line so you can follow the logic:

for(i in length(df))
{
  user <- df[[i]]               # Write the column to a new vector for clarity
  
  MAX      <- 100               # Set the maximum number of 0s allowed at the end
  user_rle <- rle(user)         # Get run length encoding of the column
  lens     <- user_rle$lengths  # Extract the run-length encoding lengths
  vals     <- user_rle$values   # Extract the run-length encoding values
  last     <- length(lens)      # For clarity of code, make alias for last index of rle
  
  if(vals[1] == 0) {            # If zeros at the start...
    user[seq(lens[1])] <- NA    # Replace with NA
  }
  
  if(vals[last] == 0 & lens[last] > MAX) {           # If more than 100 0s at end
    user[(-lens[last] + 1):0 + length(user)] <- NA   # Replace with NA
  }
  
  df[[i]] <- user               # Write the vector back in to the data frame
}

Note that there are more efficient ways to do this using less code, but this is intended to be easy to follow.

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87