1

Someone asked this already in a simpler version here, but I cannot quite get it to work for my case.

I have observational data on a number of individuals across multiple years for a set of questions, but not everyone is asked every question every year. I want to generate a new dataframe that has the most recent answer for each individual.

The data looks like this:

df <- data.frame(individual = c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C"), time = c(1:4), questionA = c("Yes", NA, "No", NA, "No", NA, "No", "Yes", "No", NA, NA, "No"), questionB = c(3, 5, 4, 5, 8, 6, 7, 4, 3, 1, 5, NA)) 

The resulting dataframe for this example should look like this:

most_recent <- data.frame(individual = c("A", "B", "C"), questionA = c("No", "Yes", "No"), questionB = c(5, 4, 5))

Ideally I am looking for a dplyr solution. Thank you!

3 Answers3

2

We can use dplyr's across() for this:

df %>%
  group_by(individual) %>%
  summarize(across(starts_with("question"), ~ last(na.omit(.))))
# # A tibble: 3 x 3
#   individual questionA questionB
#   <chr>      <chr>         <dbl>
# 1 A          No                5
# 2 B          Yes               4
# 3 C          No                5
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • I am dealing with a large number of variables, so combining this across() approach with the answer by Waldi above will be immensely helpful! Thank you! – Dominic Gohla May 12 '21 at 16:57
  • Frankly, I don't know why you'd combine approaches. `last(question)` is a shorter, dplyr-native way to do `tail(question, 1)`, and `na.omit(question)` is a shorter way to do `question[!is.na(question)]`. – Gregor Thomas May 12 '21 at 17:04
  • 1
    Discovered the source of my error and fixed it - thank you for your help! – Dominic Gohla May 12 '21 at 17:34
0

My take in base R, it filters the df by the most recent time of each person

df <- data.frame(individual = c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C"), 
                 time = c(1:4), 
                 questionA = c("Yes", NA, NA, "No", "No", NA, NA, "Yes", "No", NA, NA, "No"), 
                 questionB = c(3, 5, 4, 5, 8, 6, 7, 4, 3, 1, 3, 5),stringsAsFactors = F) 

#new column to use with %in% 
df$match <- paste(df$individual, df$time)

#find the most recent sample for each individual
id <- unique(df$individual)
most_recent <- sapply(id, function(id){
  time <- max(df$time[df$individual == id])
  return(paste(id,time))
})
#filter df by most recent
final <- df[df$match %in% most_recent,]

final

   individual time questionA questionB match
4           A    4        No         5   A 4
8           B    4       Yes         4   B 4
12          C    4        No         5   C 4
0

We could use slice_tail after filling the 'question' NA with the adjacent non-NA, grouped and ordered by 'individual', 'time' columns

library(dplyr)
library(tidyr)
df %>% 
  arrange(individual, time) %>%
  select(-time) %>%
  group_by(individual) %>% 
  fill(starts_with('question')) %>% 
  slice_tail(n = 1) %>%
  ungroup

-output

# A tibble: 3 x 3
#  individual questionA questionB
#  <chr>      <chr>         <dbl>
#1 A          No                5
#2 B          Yes               4
#3 C          No                5
akrun
  • 874,273
  • 37
  • 540
  • 662