Is there a way to summarise by percentage in R while including the data in a new data frame?

Question

I'm working a lot using Excel and R in my job and I've been trying to automatize a type of form my Boss asks me about the data quality. I've just recently started working with R so my code isn't the best.

The idea is to do a data.frame that summarizes in each column these vectors. Sum of all na's in the data.frame, percentage of NA in the data.frame and then filtering by some columns is the n of NAs in a level.

The code I've tried is the following one:

rowsna <- c("Total NA", "% NA", "n NA Variable 1, level 1",...)
na_count <- df %>% summarise_all(~sum(is.na(.)))
na_count[2, ] <- df %>% summarise_all(~mean(is.na(.)))
na_count[3, ] <- df %>% filter(variable == value) %>% summarise_all(~sum(is.na(.)))
...
row.names(na_count) <- rowsna
na_count <- as.data.frame(t(na_count))
na_count$variable

The thing is, I've got no idea how to calc the percentage of missing in the na_count[2 , ] part. I would like some help if possible.

You will probably attract more answers if you can supply some data along with expected output. You could use `tribble()` from the `tidyverse` to supply hand-entered data (see [here](https://tibble.tidyverse.org/reference/tribble.html)). — Ian Gow, Jun 01 '23 at 12:46
@Ben Answer gave me this problem: Can't Convert from double to integer due to loss of precision. I understand that It isn't understanding that i want to calc n of NAs in the data / lenght of the data... — Jorge A, Jun 02 '23 at 06:44
@JorgeA As mentioned above, you need to provide some sample data along with expected output to get an appropriate answer. Please review [this](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to learn how to make a good reproducible example. — Ben, Jun 02 '23 at 12:32
Thank you both, Im working on that reproducible example, when i simulate data with NA, it works without issue. It must be something in my dataset. I'll post an update when I know more. — Jorge A, Jun 02 '23 at 14:45
'Scoped verbs (⁠_if⁠, ⁠_at⁠, ⁠_all⁠) have been superseded by the use of pick() or across() in an existing verb. See vignette("colwise") for details.' — Mark, Jul 19 '23 at 10:39

score 1 · Accepted Answer · answered Jul 19 '23 at 10:38

It sounds like this is what you want:

library(tidyverse)

# toy dataset
df <- tibble(
  id = 1:10,
  x = c(1:9, NA),
  y = c(1:5, rep(NA, 5)),
  z = rep(NA, 10)
)

NA_df <- df %>%
  # we find the number of NAs in each column
  summarise(across(everything(), ~ sum(is.na(.x)))) %>%

  # then we pivot it longer
  pivot_longer(cols = everything()) %>%

  # then find the percentage of NAs in each column
  mutate(mean = 100*value/nrow(df))

# let's say for the sake of argument that we only want to get columns with less than 5 NAs
threshold <- 5

good_columns <- NA_df %>%
  filter(value < threshold) %>%
  pull(name)

# now we can use the good_columns vector to subset the original dataframe
df %>%
  select(all_of(good_columns))

# A tibble: 10 × 2
      id     x
   <int> <int>
 1     1     1
 2     2     2
 3     3     3
 4     4     4
 5     5     5
 6     6     6
 7     7     7
 8     8     8
 9     9     9
10    10    NA

Is there a way to summarise by percentage in R while including the data in a new data frame?

1 Answers1