Collapsing rows where some are all NA, others are disjoint with some NAs

Question

I have a simple dataframe as such:

ID    Col1    Col2    Col3    Col4
1     NA      NA      NA      NA  
1     5       10      NA      NA
1     NA      NA      15      20
2     NA      NA      NA      NA  
2     25      30      NA      NA
2     NA      NA      35      40

And I would like to reformat it as such:

ID    Col1    Col2    Col3    Col4
1     5       10      15      20
2     25      30      35      40

(please note: the real data set has thousands of rows and the values are from biological data -- the NAs follow no simple pattern, except that the NAs are disjoint, and yes there are exactly 3 rows for each ID).

STEP ONE: get rid of rows that have only NA values.

On the surface this looked simple, but I've run across some problems.

complete.cases(DF) returns all FALSE, so I can't really use this to remove the rows with all NAs, as in DF[complete.cases(DF),]. This is because all rows contain at least one NA.

Since NAs want to propagate themselves, other schemes using is.na fail for the same reason.

STEP TWO: collapse the remaining two rows into one.

Thinking about using something like aggregate to pull this off, but there has got to be an easier way than this, which doesn't work at all.

Thanks for any advice.

related: [blend of na.omit and na.pass using aggregate in r](http://stackoverflow.com/questions/17737174/blend-of-na-omit-and-na-pass-using-aggregate-in-r) — thelatemail, Jan 20 '15 at 02:16
Is there always exactly one non-NA value per column per group or could there be more? — talat, Jan 20 '15 at 08:02

Rich Scriven · Answer 1 · 2016-06-29T18:48:24.373

18

Here's a data table approach that uses na.omit() across the columns, grouped by ID.

library(data.table)
setDT(df)[, lapply(.SD, na.omit), by = ID]
#    ID Col1 Col2 Col3 Col4
# 1:  1    5   10   15   20
# 2:  2   25   30   35   40

edited Jun 29 '16 at 18:48

answered Jan 20 '15 at 01:59

Rich Scriven

97,041
11
181
245

6

`na.omit` is clever. Better than `sum` and will work on non-numeric data. I'm discreetly going to edit... – thelatemail Jan 20 '15 at 02:03
This solution doesn't work if any column has all NAs for a certain ID. – icedcoffee Sep 18 '20 at 12:25

score 12 · Accepted Answer · edited Feb 13 '19 at 08:05

Try

library(dplyr)
DF %>% group_by(ID) %>% summarise_each(funs(sum(., na.rm = TRUE)))

Edit: To account for the case in which one column has all NAs for a certain ID, we need sum_NA() function which returns NA if all are NAs

txt <- "ID    Col1    Col2    Col3    Col4
        1     NA      NA      NA      NA
        1     5       10      NA      NA
        1     NA      NA      15      20
        2     NA      NA      NA      NA
        2     NA      30      NA      NA
        2     NA      NA      35      40"
DF <- read.table(text = txt, header = TRUE)

# original code
DF %>% 
  group_by(ID) %>% 
  summarise_each(funs(sum(., na.rm = TRUE)))

# `summarise_each()` is deprecated.
# Use `summarise_all()`, `summarise_at()` or `summarise_if()` instead.
# To map `funs` over all variables, use `summarise_all()`
# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2     0    30    35    40

sum_NA <- function(x) {if (all(is.na(x))) x[NA_integer_] else sum(x, na.rm = TRUE)}

DF %>%
  group_by(ID) %>%
  summarise_all(funs(sum_NA))

DF %>%
  group_by(ID) %>%
  summarise_if(is.numeric, funs(sum_NA))

# A tibble: 2 x 5
     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2    NA    30    35    40

Yes, but I think that would yield a little dissonant result if any of the columns had more than 2 non-NA observations. — Khashaa, Jan 20 '15 at 03:54
This solution doesn't work if any column has all `NA`s for a certain `ID`. For example replace `25` by `NA` for `ID = 2` in `Col1`. Whereas `funs(na.omit(.))` will throw error `Error: Column Col1 must be length 1 (a summary value), not 0` — Tung, Sep 10 '18 at 18:33

thelatemail · Answer 3 · 2015-01-20T02:05:21.503

Here's a couple of aggregate attempts:

aggregate(. ~ ID, data=dat, FUN=na.omit, na.action="na.pass")
#  ID Col1 Col2 Col3 Col4
#1  1    5   10   15   20
#2  2   25   30   35   40

Since aggregate's formula interface by default uses na.omit on the entire data before doing any grouping, it will delete every row of dat as they all contain at least one NA value. Try it: nrow(na.omit(dat)) returns 0. So in this case, use na.pass in aggregate and then na.omit to skip over the NAs that were passed through.

Alternatively, don't use the formula interface and specify the columns to aggregate manually:

aggregate(dat[-1], dat[1], FUN=na.omit )
aggregate(dat[c("Col1","Col2","Col3","Col4")], dat["ID"], FUN=na.omit)
#  ID Col1 Col2 Col3 Col4
#1  1    5   10   15   20
#2  2   25   30   35   40

score 8 · Answer 4 · answered Nov 19 '20 at 10:59

Since dplyr 1.0.0, you can also do (using the data provided by @Khashaa):

df %>% 
 group_by(ID) %>%
 summarize(across(everything(), ~ first(na.omit(.))))

     ID  Col1  Col2  Col3  Col4
  <int> <int> <int> <int> <int>
1     1     5    10    15    20
2     2    NA    30    35    40

Jthorpe · Answer 5 · 2015-01-20T01:38:51.760

1

the simple way is:

as.data.frame(lapply(myData[,c('Col1','Col2','Col3','Col4')],function(x)[!is.na(x)]))

but if not all columns have the same number of non-NA values then you'll need to trim them like so:

temp  <-  lapply(myData[,c('Col1','Col2','Col3','Col4')],function(x)x[!is.na(x)])
len  <-  min(sapply(temp,length))
as.data.frame(lapply(temp,`[`,seq(len)))

edited Jan 20 '15 at 01:38

answered Jan 20 '15 at 01:32

Jthorpe

9,756
2
49
64

Collapsing rows where some are all NA, others are disjoint with some NAs

5 Answers5

Linked

Related