1

I'm having trouble to exclude missing values in summarise_all function. I have a dataset (df) as shown below and basically I'm having two problems:

  1. excluding missing values and the output only being one number
  2. additional data rows with same IDs but NA values (the second column with 'TRUE' values in df1 dataset)

df1 dataset is the one I'm trying to get to.

Here's the whole enchilada:

df #the original dataset
  ID  type of data   genes1  genes2   genes3  ...
  1   new            2        NA      NA
  1   old            NA       0       NA
  1   suggested      NA       NA      2
  2   new            1        NA      NA
  2   old            NA       1       NA
  2   suggested      NA       NA      1
  ...
  
df1 <- df %>% group_by(df$ID) %>% summarize_all(list, na.rm= TRUE) #my code

#output 
  ID  type of data                genes1      genes2      genes3  ...
  1   c("new","old","suggested")  c(2,NA,NA)  c(0,NA,NA)  c(2,NA,NA)
  1  TRUE                         TRUE        TRUE        TRUE         
  2  c("new","old","suggested")   c(1,NA,NA)  c(1,NA,NA)  c(1,NA,NA)
  2  TRUE                         TRUE        TRUE        TRUE           
  ...

#my main concern is the "genes" type of data and the rows with same IDs and NA values, I wanted something like this  
df1 #dream dataset
  ID  type of data      genes1    genes2    genes3  ...
  1  #doesn't matter    2         0         2
  2  #doesn't matter    1         1         1
  ...


  
  

I also tried using na.omit in summarise_all but it didn't really fix anything.

Does anybody have any ideas on how to fix it?

371.
  • 23
  • 5

5 Answers5

1

You could do:

library(dplyr)

df %>%
  group_by(ID) %>%
  summarise(across(starts_with('genes'), ~.[!is.na(.)]))

#> # A tibble: 2 × 4
#>      ID genes1 genes2 genes3
#>   <dbl>  <dbl>  <dbl>  <dbl>
#> 1     1      2      0      2
#> 2     2      1      1      1

Aron Strandberg
  • 3,040
  • 9
  • 15
1

Another way

library (dplyr)

df[-2] |> 
  group_by(ID) |> 
  fill(genes1:genes3, .direction = "downup") |> 
  slice(1)

     ID genes1 genes2 genes3
  <int>  <int>  <int>  <int>
1     1      2      0      2
2     2      1      1      1
0

If you are not worried about the type column you can do something like this

library(tidyverse)

"  ID  type   genes1  genes2   genes3
  1   new            2        NA      NA
  1   old            NA       0       NA
  1   suggested      NA       NA      2
  2   new            1        NA      NA
  2   old            NA       1       NA
  2   suggested      NA       NA      1" %>% 
  read_table() -> df

df %>% 
  pivot_longer(-c(ID, type)) %>% 
  drop_na(value) %>% 
  select(-type) %>% 
  pivot_wider(names_from = name, values_from = value)
# A tibble: 2 × 4
     ID genes1 genes2 genes3
  <dbl>  <dbl>  <dbl>  <dbl>
1     1      2      0      2
2     2      1      1      1
0

If you want to keep the "type of data" column while using summarise, you can use the following code:

df <- read.table(text = "ID  type_of_data   genes1  genes2   genes3  
  1   new            2        NA      NA
  1   old            NA       0       NA
  1   suggested      NA       NA      2
  2   new            1        NA      NA
  2   old            NA       1       NA
  2   suggested      NA       NA      1", header = TRUE)  

library(dplyr)
library(tidyr)
df1 <- df %>% 
  group_by(ID) %>% 
  summarise(across(starts_with("genes"), na.omit),
            type_of_data = type_of_data[genes1]) %>%
  ungroup() 
df1
#> # A tibble: 2 × 5
#>      ID genes1 genes2 genes3 type_of_data
#>   <int>  <int>  <int>  <int> <chr>       
#> 1     1      2      0      2 old         
#> 2     2      1      1      1 new

Created on 2022-07-26 by the reprex package (v2.0.1)

Quinten
  • 35,235
  • 5
  • 20
  • 53
0

An alternative approach based on the coalesce() function from tidyr

In the below code, we remove the type variable since the OP indicated we don't need it in the output. We then group_by() to essentially break up our data into separate data.frames for each ID. The coalesce_by_column() function we define then converts each of these into a list whose elements are each a vector of values for each gene column.

We finally can pass this list to coalesce(). coalesce() takes a set of vectors and finds the first non-NA value across the vectors for each index of the vectors. In practice, this means it can take multiple columns with only one or zero non-NA value across all columns for each index and collapse them into a single column with as many non-NA values as possible.

Usually we would have to pass each vector as its own object to coalesce() but we can use the (splice operator)[https://stackoverflow.com/questions/61180201/triple-exclamation-marks-on-r] !!! to pass each element of our list as its own vector. See the last example in ?"!!!" for a demonstration.

library(dplyr)
library(tidyr)

# Define a function to coalesce by column
coalesce_by_column <- function(df) {
  coalesce(!!! as.list(df))
}

# Remove NA rows
df %>% 
  select(-type) %>%
  group_by(ID) %>%
  summarise(across(.fns = coalesce_by_column))
#> # A tibble: 2 x 4
#>      ID genes1 genes2 genes3
#>   <dbl>  <dbl>  <dbl>  <dbl>
#> 1     1      2      0      2
#> 2     2      1      1      1
socialscientist
  • 3,759
  • 5
  • 23
  • 58