2

I am trying to create a new variable that is basically the starting value of another variable in my dataframe. Example data:

id <- rep(c(1, 2), each = 8)
outcome <- rep(1:5, length.out = 16)
time <- rep(c(0, 1, 3, 4),4)
Attitude <- rep(c('A1', 'A2', 'A1', 'A2'), each = 4)
df <- data.frame(id, Attitude, outcome, time)

What I'd like to get is a new column named new_var (or whatever) that is equal to the value of outcome at time == 0 for id = id and also depends on Attitude. Thus what I'd like to extend the dataframe to is:

df$new_var <- c(1,1,1,1,5,5,5,5,4,4,4,4,3,3,3,3)

Only then with some decent coding. In SAS I know I can do this with the lag function. I would really appreciate a solution that isn't a 'work around' so it is like SAS, but rather the proper r solution. In the end I want to get stronger in r too.

Related: Retain and lag function in R as SAS However I prefer some solution that is based on indices or the 'usual' r way. And here it's also not dependent on other conditions.

So, important here is that the coding works for the different ids, attitude levels / variables (A1, A2, ...) and that the outcome value at time == 0 is basically copied to new_var.

I hope I am clear in conveying my message. If not I think the small piece of example code and how I'd like to extend it should be clear enough. Looking forward to suggestions.

EDIT Another example code for @jogo answer.

ID <- rep(1, 36)
Attitude <- rep(c('A1', 'A2','A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9'), 
length.out =36)
Answer_a <- rep(1:5, length.out = 36)
time <- as.character(rep(c(0, 1, 3, 4), each = 9))

df <- data.frame(ID, Attitude, Answer_a, time)
df$time <- as.character(df$time)
Anonymous
  • 502
  • 4
  • 23

2 Answers2

1

I think this is what you mean - assuming the data is always in the correct order?

EDIT Added an arrange step to ensure the data is always correctly ordered.

        library(tidyverse)
        df %>% group_by(id, Attitude) %>% 
               arrange(time) %>% 
               mutate(new_var2 = first(outcome[!is.na(outcome)])

        # A tibble: 16 x 6
        # Groups:   id, Attitude [4]
              id Attitude outcome  time new_var new_var2
           <dbl> <fct>      <int> <dbl>   <dbl>    <int>
         1  1.00 A1             1  0       1.00        1
         2  1.00 A1             2  1.00    1.00        1
         3  1.00 A1             3  3.00    1.00        1
         4  1.00 A1             4  4.00    1.00        1
         5  1.00 A2             5  0       5.00        5
         6  1.00 A2             1  1.00    5.00        5
         7  1.00 A2             2  3.00    5.00        5
         8  1.00 A2             3  4.00    5.00        5
         9  2.00 A1             4  0       4.00        4
        10  2.00 A1             5  1.00    4.00        4
        11  2.00 A1             1  3.00    4.00        4
        12  2.00 A1             2  4.00    4.00        4
        13  2.00 A2             3  0       3.00        3
        14  2.00 A2             4  1.00    3.00        3
        15  2.00 A2             5  3.00    3.00        3
        16  2.00 A2             1  4.00    3.00        3
Stephen Henderson
  • 6,340
  • 3
  • 27
  • 33
  • Awesome, this seems to work! One question though, just to be sure. By 'assuming the data is always in the correct order', you mean it's properly ordered by `time` and `id` right? Or does it also depend on how `Attitude` is ordered? Will do a more thorough check asap as to whether it indeed works, but it seems to. – Anonymous Feb 23 '18 at 23:04
  • it uses the order as shown to pick the `first(outcome)`. You can make sure that you data is in the correct order by adding `%>% arrange(time) %>%` between the `group_by` and `mutate` – Stephen Henderson Feb 23 '18 at 23:09
  • Thanks for your help. Seems like a package worth exploring. – Anonymous Feb 23 '18 at 23:50
  • Which package is it exactly that you used? Trying to get some more info. Edit: I realize you loaded tidyverse, but I'm wondering which one of those (as there seem to be a lot that get installed with it). – Anonymous Feb 24 '18 at 12:58
  • All `dplyr` function – Stephen Henderson Feb 24 '18 at 18:55
  • I noticed that in a few instances I get an NA value in my original dataset, as some `outcomes` at `time = 0` are `NA`. Is there a quick way to say it should pick the second value (or next value) instead? – Anonymous Mar 01 '18 at 12:16
  • I've edited it to use the first non NA value. If they are all NA you will still get NA – Stephen Henderson Mar 01 '18 at 13:02
0

Here is a solution with data.table:

library("data.table")
setDT(df)
df[, new_var:=outcome[1], rleid(Attitude)][]  # or
# df[, new_var:=outcome[time==0], rleid(Attitude)][]

For testing I named the new column new_var2:

id <- rep(c(1, 2), each = 8)
outcome <- rep(1:5, length.out = 16)
time <- rep(c(0, 1, 3, 4),4)
Attitude <- rep(c('A1', 'A2', 'A1', 'A2'), each = 4)
df <- data.frame(id, Attitude, outcome, time)
df$new_var <- c(1,1,1,1,5,5,5,5,4,4,4,4,3,3,3,3)

library("data.table")
setDT(df)
df[, new_var2:=outcome[1], rleid(Attitude)][] 
# > df[, new_var2:=outcome[1], rleid(Attitude)][]
#     id Attitude outcome time new_var new_var2
#  1:  1       A1       1    0       1        1
#  2:  1       A1       2    1       1        1
#  3:  1       A1       3    3       1        1
#  4:  1       A1       4    4       1        1
#  5:  1       A2       5    0       5        5
#  6:  1       A2       1    1       5        5
#  7:  1       A2       2    3       5        5
#  8:  1       A2       3    4       5        5
#  9:  2       A1       4    0       4        4
# 10:  2       A1       5    1       4        4
# 11:  2       A1       1    3       4        4
# 12:  2       A1       2    4       4        4
# 13:  2       A2       3    0       3        3
# 14:  2       A2       4    1       3        3
# 15:  2       A2       5    3       3        3
# 16:  2       A2       1    4       3        3

Your second example shows that you have to reorder the rows of the data. Usinf data.table this can be done by setkey():

ID <- rep(1, 36)
Attitude <- rep(c('A1', 'A2','A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9'), 
                length.out =36)
Answer_a <- rep(1:5, length.out = 36)
time <- as.character(rep(c(0, 1, 3, 4), each = 9))

df <- data.frame(ID, Attitude, Answer_a, time)
df$time <- as.character(df$time)

library("data.table")
setDT(df)
setkey(df, ID, Attitude, time)
df[, new_var:=Answer_a[1], rleid(Attitude)]
df
jogo
  • 12,469
  • 11
  • 37
  • 42
  • For some reason this doesn't seem to work on the dataset I am working with. It simply copies the answer at time 0, then copies it at time 1, etc. So it doesn't fix `new_var's` value at that of `outcome` at `time == 0` for `time ==1` etc. Got an idea why this is happening? – Anonymous Feb 23 '18 at 22:40
  • I edited the answer - included the testing of the code with your data from the question. If you have data for which this code is not working, please give an example in your question. Then I can modify the code. – jogo Feb 24 '18 at 07:08
  • Sorry for the late reply, see the edit in the question for an example. Note that I only have 1 ID here, whereas in the real dataset there are many more. But I don't think that matters. – Anonymous Feb 26 '18 at 10:44
  • That is a matter of reorder the data (similar to the other answer). I will edit my answer. – jogo Feb 26 '18 at 10:57