1

I'd like to calculate relative changes of measured variables in a data.frame by group with dplyr. The changes are with respect to a first baseline value at time==0.

I can easily do this in the following example:

 # with this easy example it works 
 df.easy <- data.frame( id  =c(1,1,1,2,2,2)
                   ,time=c(0,1,2,0,1,2)
                   ,meas=c(5,6,9,4,5,6))

 df.easy %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative =
 meas/meas[time==0])
     # Source: local data frame [6 x 4]
     # Groups: id [2]
     # 
     #      id  time  meas meas.relative
     #   <dbl> <dbl> <dbl>         <dbl>
     # 1     1     0     5          1.00
     # 2     1     1     6          1.20
     # 3     1     2     9          1.80
     # 4     2     0     4          1.00
     # 5     2     1     5          1.25
     # 6     2     2     6          1.50

However, when there are id's with no measuremnt at time==0, this doesn't work. A similar question is this, but I'd like to get an NA as a result instead of simply taking the first occurence as baseline.

 # how to output NA in case there are id's with no measurement at time==0?
 df <- data.frame( id  =c(1,1,1,2,2,2,3,3)
                  ,time=c(0,1,2,0,1,2,1,2)
                  ,meas=c(5,6,9,4,5,6,5,6))

 # same approach now gives an error:
     df %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative = meas/meas[time==0])
     # Error in mutate_impl(.data, dots) : 
     #   incompatible size (0), expecting 2 (the group size) or 1

Let's try to return NA in case no measurement at time==0 was taken, using ifelse

 df %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative = ifelse(any(time==0), meas/meas[time==0], NA) )
     # Source: local data frame [8 x 4]
     # Groups: id [3]
     # 
     #      id  time  meas meas.relative
     #   <dbl> <dbl> <dbl>         <dbl>
     # 1     1     0     5             1
     # 2     1     1     6             1
     # 3     1     2     9             1
     # 4     2     0     4             1
     # 5     2     1     5             1
     # 6     2     2     6             1
     # 7     3     1     5            NA
     # 8     3     2     6            NA>

Wait, why is above the relative measurement 1?

identical(
    df %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative = ifelse(any(time==0), meas, NA) ),
    df %>% dplyr::group_by(id) %>% dplyr::mutate(meas.relative = ifelse(any(time==0), meas[time==0], NA) )
    )
    # TRUE

It seems that the ifelse prevents meas to pick the current line, but selects always the subset where time==0.

How can I calculate relative changes when there are IDs with no baseline measurement?

Community
  • 1
  • 1
fabern
  • 318
  • 2
  • 10
  • A workaround is the following two-step solution `df <- df %>% dplyr::group_by(id) %>% dplyr::mutate(meas.baseline = ifelse(any(time==0), meas[time==0],NA))` and `dplyr::mutate(df, meas.relative = meas/meas.baseline)`. – fabern Apr 12 '17 at 18:14
  • 2
    In data.table, I believe you can do this with `setDT(df)[, meas.rel := (meas) / meas[time==0], by=id]` though I'm not 100% sure of your desired output. – lmo Apr 12 '17 at 18:17
  • Thanks for this reply @Imo. However, I haven't tested it, as I don't have data.table on my work machine. Does this return `NA` in case `meas[time==0]` does not exist for a group? – fabern Apr 13 '17 at 09:43
  • 1
    Yes. The last two observations in the example are NA and the ratios are calculated for the other values. – lmo Apr 13 '17 at 11:30

1 Answers1

1

Your issue was in the ifelse(). According to the ifelse documentation it returns "A vector of the same length...as test". Since any(time==0) is of length 1 for each group (TRUE or FALSE) only the first observation of the meas / meas[time==0] was being selected. This was then repeated to fill each group.

To fix this all I did was rep the any() to be the length of the group. I believe this should work:

df %>% dplyr::group_by(id) %>% 
       dplyr::mutate(meas.relative = ifelse(rep(any(time==0),times = n()), meas/meas[time==0], NA) )

  #       id  time  meas meas.relative
  #    <dbl> <dbl> <dbl>         <dbl>
  #  1     1     0     5          1.00
  #  2     1     1     6          1.20
  #  3     1     2     9          1.80
  #  4     2     0     4          1.00
  #  5     2     1     5          1.25
  #  6     2     2     6          1.50
  #  7     3     1     5            NA
  #  8     3     2     6            NA

To see how this was working incorrectly in your case try:

ifelse(TRUE,c(1,2,3),NA)
#[1] 1

Edit: A data.table solution with the same concept:

as.data.table(df)[, meas.rel := ifelse(rep(any(time==0), .N), meas/meas[time==0], NA_real_)
                  ,by=id]
Mike H.
  • 13,960
  • 2
  • 29
  • 39