5

I'm quite new to R and while I have done some data wrangling with it, I am completely at a loss on how to tackle this problem. Google and SO search didn't get me anywhere so far. Should this be a duplicate, I'm sorry, then please point me to the right solution.

I have a df with 2 columns called id and seq. like so

set.seed(12)
id <- rep(c(1:2),10)
seq<-sample(c(1:4),20,replace=T)
df <- data.frame(id,seq)
df <- df[order(df$id),]

    id seq  
 1   1   1
 3   1   4
 5   1   1
 7   1   1
 9   1   1
 11  1   2
 13  1   2
 15  1   2
 17  1   2
 19  1   3
 2   2   4
 4   2   2
 6   2   1
 8   2   3
 10  2   1
 12  2   4
 14  2   2
 16  2   2
 18  2   3
 20  2   1

I would need to count the number of unequal elements in between the equal elements in the seq column e.g. how many elements are between 1 and 1 or 3 and 3 etc. The first instance of the element should be NaN because there is no element before this to count.If the next element is identical it should just code 0, as there is no unequal element in-between e.g. 1 and 1. The results should be written out in a new column e.g. delay.

One catch is that this process would have to start again once a new id starts in the id column (here: 1 & 2).

This is what I would love to have as output:

     id seq   delay 
 1   1   1     NA
 3   1   4     NA
 5   1   1     1
 7   1   1     0
 9   1   1     0
 11  1   2     NA
 13  1   2     0
 15  1   2     0
 17  1   2     0
 19  1   3     NA
 2   2   4     NA
 4   2   2     NA
 6   2   1     NA
 8   2   3     NA
 10  2   1     1
 12  2   4     4
 14  2   2     4
 16  2   2     0
 18  2   3     4
 20  2   1     4

I really hope someone might be able to help me figure this out and allow me learn more about this.

Dominik
  • 1,016
  • 11
  • 30
Vanessa S.
  • 133
  • 6

4 Answers4

5

A simple dplyr solution:

df %>%
  mutate(row = 1:n()) %>%
  group_by(id, seq) %>%
  mutate(delay = row - lag(row) - 1) %>%
  select(-row)
# # A tibble: 20 x 3
# # Groups:   id, seq [8]
#       id   seq delay
#    <int> <int> <dbl>
#  1     1     1    NA
#  2     1     4    NA
#  3     1     1     1
#  4     1     1     0
#  5     1     1     0
#  6     1     2    NA
#  7     1     2     0
#  8     1     2     0
#  9     1     2     0
# 10     1     3    NA
# 11     2     4    NA
# 12     2     2    NA
# 13     2     1    NA
# 14     2     3    NA
# 15     2     1     1
# 16     2     4     4
# 17     2     2     4
# 18     2     2     0
# 19     2     3     4
# 20     2     1     4
Scarabee
  • 5,437
  • 5
  • 29
  • 55
  • here's an uncreative base translation: `df$delay <- ave(df$seq, df$id,FUN= function(x) ave(seq_along(x), x, FUN = function(y) y - c(NA, y[-length(y)]) -1))` – moodymudskipper Aug 09 '18 at 15:09
2

Here is a possibility using a custom function within a dplyr chain

my.function <- function(x) {
    ret <- rep(NA, length(x))
    for (i in 2:length(x)) {
        for (j in (i-1):1) {
            if (x[j] == x[i]) {
                ret[i] = i - j - 1
                break
            }
        }
    }
    return(ret)
}

library(dplyr)
df %>%
    group_by(id) %>%
    mutate(delay = my.function(seq))
## A tibble: 20 x 3
## Groups:   id [2]
#      id   seq delay
#   <int> <int> <dbl>
# 1     1     1   NA
# 2     1     4   NA
# 3     1     1    1.
# 4     1     1    0.
# 5     1     1    0.
# 6     1     2   NA
# 7     1     2    0.
# 8     1     2    0.
# 9     1     2    0.
#10     1     3   NA
#11     2     4   NA
#12     2     2   NA
#13     2     1   NA
#14     2     3   NA
#15     2     1    1.
#16     2     4    4.
#17     2     2    4.
#18     2     2    0.
#19     2     3    4.
#20     2     1    4.    

Some further explanations:

  1. We group rows by id and then apply my.function to entries in column seq. This ensures that we treat rows with different ids separately.

  2. my.function takes a vector of numeric entries, checks for previous equal entries, and returns the distance between the current and previous equal entry minus one (i.e. it counts the number of elements in between).

  3. my.function uses two for loops but this should be fast because we don't dynamically grow any vectors (ret is pre-allocated at the beginning of my.function) and we break the inner loop as soon as we encounter an equal element.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
1

Try:

set.seed(12)
id <- rep(c(1:2),10)
seq<-sample(c(1:4),20,replace=T)
df <- data.frame(id,seq)
df <- df[order(df$id),]
df

get_lead <- function(x) {
  x <- as.character(x)
  l <- list(unique(x))
  res <- rep(NA, length(x))
  for (i in seq_along(x)) {
    if (!is.null(l[[x[i] ]])) {
      res[i] <- (i - l[[x[i] ]] - 1)
    }
    l[[x[i] ]] <- i
  }
  res
}
df$delay <- unlist(lapply(split(df$seq, df$id), get_lead))
df  

# id seq delay
#1   1   1    NA
#3   1   4    NA
#5   1   1     1
#7   1   1     0
#9   1   1     0
#11  1   2    NA
#13  1   2     0
#15  1   2     0
#17  1   2     0
#19  1   3    NA
#2   2   4    NA
#4   2   2    NA
#6   2   1    NA
#8   2   3    NA
#10  2   1     1
#12  2   4     4
#14  2   2     4
#16  2   2     0
#18  2   3     4
#20  2   1     4
r.user.05apr
  • 5,356
  • 3
  • 22
  • 39
  • Thank you for your solution! This also worked, but as a relative beginner in R the first solution was a bit easier to understand. Thus, I accepted the first as my answer, though your function worked just as well :-) – Vanessa S. Aug 09 '18 at 12:54
0

Here is approach: -write function to find which row is start one for index - write function which calculate number of different numbers versus the latest repetetive - apply function to all rows and assign to variable delay

Indstart <- function(j,df){
  ind_start <- min(which(df[1:j,1]==df[j,1]))
}

difval <- function( j, df){
  i <- Indstart(j, df)
  pos_j_pr <- ifelse(length(which(df[i:(j-1),2]==df[j,2]))>0, max(which(df[i:(j-1),2]==df[j,2])) + i-1, 0)
  non_rep_num <- ifelse(pos_j_pr>0, sum(df[pos_j_pr:j,2] != df[j,2]), "NA")
  return(non_rep_num)
}

for (j in 1:length(df[,1])){
  df$delay[j] <- difval(j,df)
}
Nar
  • 648
  • 4
  • 8