1

Just say we run a study where participants are measured on some outcome variable four times each. At the start of testing they provide their age and sex. Here is some toy data to illustrate.

set.seed(1)
sex <- NA
age <- NA
df <- data.frame(id = factor(rep(1:4,each=4)),
                 time = rep(1:4,times=4),
                 sex = as.vector(sapply(0:3, function(i) sex[i*4 + 1:4] <- c(sample(c("m", "f"), 1, replace = T), rep(NA,3)))),
                 age = as.vector(sapply(0:3, function(i) age[i*4 + 1:4] <- c(sample(18:75, 1, replace = T), rep(NA,3)))),
                 outcome = round(rnorm(16),2),
                 stringsAsFactors = F)

Here is what the data looks like

df

# output
# id time  sex age outcome
#  1    1    m  29    0.33
#  1    2 <NA>  NA   -0.82
#  1    3 <NA>  NA    0.49
#  1    4 <NA>  NA    0.74
#  2    1    m  70    0.58
#  2    2 <NA>  NA   -0.31
#  2    3 <NA>  NA    1.51
#  2    4 <NA>  NA    0.39
#  3    1    f  72   -0.62
#  3    2 <NA>  NA   -2.21
#  3    3 <NA>  NA    1.12
#  3    4 <NA>  NA   -0.04
#  4    1    f  56   -0.02
#  4    2 <NA>  NA    0.94
#  4    3 <NA>  NA    0.82
#  4    4 <NA>  NA    0.59

Now what I want to do is to use the tidyverse to apply the values for the demographic variables, at present only on the first row of each participant's data, to all the rows.

At present all I could come up with was

df %>% group_by(id) %>% # group by id
       distinct(sex) %>% # shrink to unique values for each id
       dplyr::filter(!is.na(sex)) %>% # remove the NAs
       left_join(df, by = "id")

Which yields the output

# A tibble: 16 x 6
# Groups:   id [4]
#    sex.x id     time sex.y   age outcome
#    <chr> <fct> <int> <chr> <int>   <dbl>
#  1 m     1         1 m        29   0.33 
#  2 m     1         2 NA       NA  -0.82 
#  3 m     1         3 NA       NA   0.49 
#  4 m     1         4 NA       NA   0.74 
#  5 m     2         1 m        70   0.580
#  6 m     2         2 NA       NA  -0.31 
#  7 m     2         3 NA       NA   1.51 
#  8 m     2         4 NA       NA   0.39 
#  9 f     3         1 f        72  -0.62 
# 10 f     3         2 NA       NA  -2.21 
# 11 f     3         3 NA       NA   1.12 
# 12 f     3         4 NA       NA  -0.04 
# 13 f     4         1 f        56  -0.02 
# 14 f     4         2 NA       NA   0.94 
# 15 f     4         3 NA       NA   0.82 
# 16 f     4         4 NA       NA   0.59 

Now I would consider this partially successful because the first row in each participant's sex.x column has now been applied to all their other rows, but I really don't like that there are now two sex columns.

Now I could easily add some more functions to the chain that remove the superfluous sex.y column and rename the sex.x column to its original form, but this seems a bit clunky.

Can anyone suggest how to do this better?

llewmills
  • 2,959
  • 3
  • 31
  • 58
  • Does this answer your question? [Replacing NAs with latest non-NA value](https://stackoverflow.com/questions/7735647/replacing-nas-with-latest-non-na-value). This is a better dupe: https://stackoverflow.com/questions/40040834/replace-na-with-previous-or-next-value-by-group-using-dplyr. – Ritchie Sacramento Jun 26 '20 at 00:33
  • Yes the second one answers the same question the same way. – llewmills Jun 26 '20 at 00:44
  • Maybe your previous step(s) needs a make-over. How is the data generated? If your data are separated into two data frames (one for baseline data and one for follow-up data), merging them together will give you your desired output, and no need to `fill` anything. – Edward Jun 26 '20 at 01:12
  • Sadly @Edward they were all part of the same extract from a redcap database. I could have extracted the datasets separately and done it that way no question. This question was more for my edification than anything else. – llewmills Jun 26 '20 at 03:04

1 Answers1

0

You can fill the sex value for each id :

library(dplyr)
df %>% group_by(id) %>% tidyr::fill(sex)

#   id     time sex     age outcome
#   <fct> <int> <chr> <int>   <dbl>
# 1 1         1 m        51  -1.54 
# 2 1         2 m        NA  -0.93 
# 3 1         3 m        NA  -0.290
# 4 1         4 m        NA  -0.01 
# 5 2         1 f        40   2.4  
# 6 2         2 f        NA   0.76 
# 7 2         3 f        NA  -0.8  
# 8 2         4 f        NA  -1.15 
# 9 3         1 m        60  -0.290
#10 3         2 m        NA  -0.3  
#11 3         3 m        NA  -0.41 
#12 3         4 m        NA   0.25 
#13 4         1 m        31  -0.89 
#14 4         2 m        NA   0.44 
#15 4         3 m        NA  -1.24 
#16 4         4 m        NA  -0.22 

You could also fill age value.(df %>% group_by(id) %>% tidyr::fill(sex, age)).

PS - I get different numbers from the same seed value though.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213