Just say we run a study where participants are measured on some outcome variable four times each. At the start of testing they provide their age and sex. Here is some toy data to illustrate.
set.seed(1)
sex <- NA
age <- NA
df <- data.frame(id = factor(rep(1:4,each=4)),
time = rep(1:4,times=4),
sex = as.vector(sapply(0:3, function(i) sex[i*4 + 1:4] <- c(sample(c("m", "f"), 1, replace = T), rep(NA,3)))),
age = as.vector(sapply(0:3, function(i) age[i*4 + 1:4] <- c(sample(18:75, 1, replace = T), rep(NA,3)))),
outcome = round(rnorm(16),2),
stringsAsFactors = F)
Here is what the data looks like
df
# output
# id time sex age outcome
# 1 1 m 29 0.33
# 1 2 <NA> NA -0.82
# 1 3 <NA> NA 0.49
# 1 4 <NA> NA 0.74
# 2 1 m 70 0.58
# 2 2 <NA> NA -0.31
# 2 3 <NA> NA 1.51
# 2 4 <NA> NA 0.39
# 3 1 f 72 -0.62
# 3 2 <NA> NA -2.21
# 3 3 <NA> NA 1.12
# 3 4 <NA> NA -0.04
# 4 1 f 56 -0.02
# 4 2 <NA> NA 0.94
# 4 3 <NA> NA 0.82
# 4 4 <NA> NA 0.59
Now what I want to do is to use the tidyverse
to apply the values for the demographic variables, at present only on the first row of each participant's data, to all the rows.
At present all I could come up with was
df %>% group_by(id) %>% # group by id
distinct(sex) %>% # shrink to unique values for each id
dplyr::filter(!is.na(sex)) %>% # remove the NAs
left_join(df, by = "id")
Which yields the output
# A tibble: 16 x 6
# Groups: id [4]
# sex.x id time sex.y age outcome
# <chr> <fct> <int> <chr> <int> <dbl>
# 1 m 1 1 m 29 0.33
# 2 m 1 2 NA NA -0.82
# 3 m 1 3 NA NA 0.49
# 4 m 1 4 NA NA 0.74
# 5 m 2 1 m 70 0.580
# 6 m 2 2 NA NA -0.31
# 7 m 2 3 NA NA 1.51
# 8 m 2 4 NA NA 0.39
# 9 f 3 1 f 72 -0.62
# 10 f 3 2 NA NA -2.21
# 11 f 3 3 NA NA 1.12
# 12 f 3 4 NA NA -0.04
# 13 f 4 1 f 56 -0.02
# 14 f 4 2 NA NA 0.94
# 15 f 4 3 NA NA 0.82
# 16 f 4 4 NA NA 0.59
Now I would consider this partially successful because the first row in each participant's sex.x
column has now been applied to all their other rows, but I really don't like that there are now two sex columns.
Now I could easily add some more functions to the chain that remove the superfluous sex.y
column and rename the sex.x
column to its original form, but this seems a bit clunky.
Can anyone suggest how to do this better?