Using Tidyverse pipeline to count NAs and reorder

Question

I want to create a vector of the count of NAs for each column in a data set and then reorder it to show the columns with the largest missing values at the top and then decreasing. I've done the following, which works:

na_vector <- household_data %>% summarise_all(list(~(sum(is.na(.))))) 
na_vector <- as.vector(na_vector)
sort(na_vector, decreasing = T)

But there must be a way to do this all within the tidyverse pipeline, right? How would I do this?

You should post a reproducible example to get a real answer, but something like lapply(household_data, function(x) sum(is.na(x)) would be a good start — Bill O'Brien, Jul 31 '20 at 20:52
It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Jul 31 '20 at 20:57

Darren Tsai · Accepted Answer · 2020-07-31T21:42:56.597

Example Data

set.seed(123)
mat <- matrix(round(rnorm(50), 2), 10, 5)
mat[sample(1:50, 20)] <- NA
df <- data.frame(mat)

#       X1    X2    X3    X4    X5
# 1  -0.56  1.22 -1.07    NA    NA
# 2  -0.23    NA    NA    NA -0.21
# 3     NA  0.40    NA  0.90    NA
# 4   0.07    NA -0.73  0.88    NA
# 5   0.13 -0.56    NA    NA  1.21
# 6   1.72  1.79 -1.69  0.69    NA
# 7     NA  0.50  0.84  0.55    NA
# 8  -1.27 -1.97  0.15 -0.06    NA
# 9  -0.69  0.70    NA -0.31  0.78
# 10 -0.45 -0.47    NA    NA    NA

1. base solution

sort(colSums(is.na(df)), decreasing = T)

# X5 X3 X4 X1 X2 
#  7  5  4  2  2

2. dplyr pipes

library(dplyr)

df %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  unlist %>% sort(decreasing = T)

# X5 X3 X4 X1 X2 
#  7  5  4  2  2

3. A complex way but with more tidyverse logic

df %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  tidyr::pivot_longer(everything()) %>%
  arrange(desc(value)) %>% 
  deframe()

# X5 X3 X4 X1 X2 
#  7  5  4  2  2

score 0 · Answer 2 · answered Jul 31 '20 at 21:40

I created a sample dataset to play around with your question. Here is the dataset I am using:

    library(tidyverse)

options <- c("Yes", "No", NA_character_)

# create the first row of the df that we will be recreating
df <- tibble(
  ID = 1,
  neckpain = "Yes",
  backpain = NA_character_,
  kneepain = NA_character_,
)

# create a function that will help build the entire reproducible df
add.option.sample.row.f <- function( df, n ){
  # FUNCTION add.option.sample.row.f
  # args: df as tibble
  #       n  as integer
  # takes df and adds options to it randomly
  # returns a 4 by n(plus nrow(df)) df of
  # ID (unique), neckpain (charcter),
  # backpain (charcter), kneepain (charcter)
  # - - - - - - - - -- - - - - - - - - - - --
  for( i in 1:n ){ 
    df <- df %>% add_row(
      ID = nrow(df)+1,
      neckpain = sample(options)[1],
      backpain = sample(options)[1],
      kneepain = sample(options)[1]
    )
  }
  return(df)
}

# build sample df
df <- add.option.sample.row.f(df, 500)

head(df)
# A tibble: 6 x 4
# ID neckpain backpain kneepain
# <dbl> <chr>    <chr>    <chr>   
# 1     1 Yes      NA       NA      
# 2     2 Yes      NA       Yes     
# 3     3 No       NA       Yes     
# 4     4 NA       NA       NA      
# 5     5 NA       No       NA      
# 6     6 NA       Yes      Yes

With this data set lets approach what you are looking to do. First lets take the questionable columns as a vector:

columns.to.reorder <- c(
  "neckpain",
  "backpain",
  "kneepain"
)

Use mutate to find the cumsum of all na's.

    df %>%
  mutate(
  !!paste0("NA_", columns.to.reorder[1]) := cumsum(is.na(.[[columns.to.reorder[1]]])+0),
  !!paste0("NA_", columns.to.reorder[2]) := cumsum(is.na(.[[columns.to.reorder[2]]])+0),
  !!paste0("NA_", columns.to.reorder[3]) := cumsum(is.na(.[[columns.to.reorder[3]]])+0)
  )

Or use the more elegant "across" argument of the newer dplyr

df %>% 
  mutate(across(.cols = columns.to.reorder,
         .fns = function(x)cumsum(is.na(x)),
         .names =  "{col}.{fn}")
  )

This will make it easier to find the MAX of each column's na's, as the cumsum will tic each additional na as they occur. I do not know how you'd like to split the vectors out as each vector's sort would resort the other vectors. Please advise the direction you are going with this.

Using Tidyverse pipeline to count NAs and reorder

2 Answers2