Data with multiple rows per observations with variables populated in some but not other rows

Question

So I have this data frame:

dat1 <- data.frame(id=1:n, 
                  group=rep(LETTERS[1:2], n/2),
                  age=sample(18:30, n, replace=TRUE),
                  type=NA,
                  op=factor(paste0("op", 1:n)),
                  x=NA)
dat1
dat2 <- data.frame(id=1:n, 
                   group=rep(LETTERS[1:2], n/2),
                   age=NA,
                   type=factor(paste0("type", 1:n)),
                   op=NA, 
                   x=rnorm(n))
dat <- full_join(dat1,dat2) %>% arrange(id)
dat

   id group age  type   op           x
1   1     A  19  <NA>  op1          NA
2   1     A  NA type1 <NA>  0.18819303
3   2     B  29  <NA>  op2          NA
4   2     B  NA type2 <NA>  0.11916096
5   3     A  19  <NA>  op3          NA
6   3     A  NA type3 <NA> -0.02509255
7   4     B  28  <NA>  op4          NA
8   4     B  NA type4 <NA>  0.10807273
9   5     A  27  <NA>  op5          NA
10  5     A  NA type5 <NA> -0.48543524
11  6     B  26  <NA>  op6          NA
12  6     B  NA type6 <NA> -0.50421713

This dataset has multiple rows per observation, while some variables are stored in one row and some in another. I like to have it in a tidy format with only one row per id. I can filter into two data frames then rejoin, but there must be an easier way. Summarise would work but only with numeric variables.. The group is fixed with id, there is no same id in different groups

  group_by(id) %>% 
  summarise_if(is.numeric, sum, na.rm=T)
data

It seems there must be a very easy solution but I can't figure it out. Thanks for help!

Not sure if this should be marked as duplicate to this: https://stackoverflow.com/questions/28036294/collapsing-rows-where-some-are-all-na-others-are-disjoint-with-some-nas . As that question is for numeric example only, and some of the answers deal with that case only. But this question requires a general case. — Karl, Nov 01 '21 at 09:26

Karl · Accepted Answer · 2023-04-15T14:24:33.770

1

From this answer by tmfmnk to a similar question, you should be able to solve this by adding this code block at the end:

dat <- group_by(dat, id) %>%
  summarize(across(everything(), ~ first(na.omit(.x))))

dat

The question in the link is a special case for only numbers, but this code block should work in general.

edited Apr 15 '23 at 14:24

answered Nov 01 '21 at 09:24

Karl

1,074
8
25

Data with multiple rows per observations with variables populated in some but not other rows

1 Answers1