Including missing values in summarise output

Question

I am trying to still keep all rows in a summarise output even when one of the columns does not exist. I have a data frame that looks like this:

dat <- data.frame(id=c(1,1,2,2,2,3),
                  seq_num=c(0:1,0:2,0:0),
                  time=c(4,5,6,7,8,9))

I then need to summarize by all ids, where id is a row and there is a column for the first seq_num and second one. Even if the second one doesn't exist, I'd still like that row to be maintained, with an NA in that slot. I've tried the answers in this answer, but they are not working.

dat %>% 
  group_by(id, .drop=FALSE) %>% 
  summarise(seq_0_time = time[seq_num==0],
            seq_1_time = time[seq_num==1])

outputs

     id seq_0_time seq_1_time
  <dbl>      <dbl>      <dbl>
1     1          4          5
2     2          6          7

I would still like a 3rd row, though, with seq_0_time=9, and seq_1_time=NA since it doesn't exist.

How can I do this?

score 2 · Accepted Answer · answered Apr 21 '22 at 16:05

If there are only max one observation per 'seq_num' for each 'id', then it is possible to coerce to NA where there are no cases with [1]

library(dplyr)
dat %>% 
  group_by(id) %>% 
  summarise(seq_0_time = time[seq_num ==0][1],
            seq_1_time = time[seq_num == 1][1], .groups = 'drop')

-output

# A tibble: 3 × 3
     id seq_0_time seq_1_time
  <dbl>      <dbl>      <dbl>
1     1          4          5
2     2          6          7
3     3          9         NA

It is just that the length of 0 can be modified to length 1 by assigning NA Or similarly this can be used to replicate NA to fill for 2, 3, etc, by specifying the index that didn't occur

> with(dat, time[seq_num==1 & id == 3])
numeric(0)
> with(dat, time[seq_num==1 & id == 3][1])
[1] NA
> numeric(0)
numeric(0)
> numeric(0)[1]
[1] NA
> numeric(0)[1:2]
[1] NA NA

Or using length<-

> `length<-`(numeric(0), 3)
[1] NA NA NA

Nick Camarda · Answer 2 · 2022-04-21T01:19:43.513

-1

My understanding is that you must use complete() on both the seq_num and id variables to achieve your desired result:

library(tidyverse)
dat <- data.frame(id=c(1,1,2,2,2,3),
                  seq_num=c(0:1,0:2,0:0),
                  time=c(4,5,6,7,8,9)) %>%
  complete(seq_num = seq_num,
           id = id)
dat %>% 
  group_by(id, .drop=FALSE) %>% 
  summarise(seq_0_time = time[seq_num==0],
            seq_1_time = time[seq_num==1])
#> # A tibble: 3 x 3
#>      id seq_0_time seq_1_time
#>   <dbl>      <dbl>      <dbl>
#> 1     1          4          5
#> 2     2          6          7
#> 3     3          9         NA

^{Created on 2022-04-20 by the reprex package (v2.0.1)}

edited Apr 21 '22 at 01:19

answered Apr 21 '22 at 00:53

Nick Camarda

310
1
7

I’m not sure I follow. I don’t want to modify the original data. And the data frame would get very sparse because some sequences are 1 unit, while others might be 50. – Adam_G Apr 21 '22 at 00:56
the original dataset doesn't contain all combinations of `id` and `seq_num`, so that's why you're missing that last row. You need to have all combinations present (maybe in a new dataframe, so that the original isn't modified) so that you can get the result you've asked for. Maybe I'm misunderstanding your question though? Could you rephrase more explicitly then? – Nick Camarda Apr 21 '22 at 00:59

score -1 · Answer 3 · answered Apr 21 '22 at 01:16

-1

This can actually be pretty easily solved using reshape.

> reshape(dat, timevar='seq_num', idvar = 'id', direction = 'wide')
  id time.0 time.1 time.2
1  1      4      5     NA
3  2      6      7      8
6  3      9     NA     NA

answered Apr 21 '22 at 01:16

Adam_G

7,337
20
86
148

1

if you're going in this direction, you could also use pivot_wider() to achieve the same effect – Nick Camarda Apr 21 '22 at 01:20

Including missing values in summarise output

3 Answers3