0

I'm working a lot using Excel and R in my job and I've been trying to automatize a type of form my Boss asks me about the data quality. I've just recently started working with R so my code isn't the best.

The idea is to do a data.frame that summarizes in each column these vectors. Sum of all na's in the data.frame, percentage of NA in the data.frame and then filtering by some columns is the n of NAs in a level.

The code I've tried is the following one:

rowsna <- c("Total NA", "% NA", "n NA Variable 1, level 1",...)
na_count <- df %>% summarise_all(~sum(is.na(.)))
na_count[2, ] <- df %>% summarise_all(~mean(is.na(.)))
na_count[3, ] <- df %>% filter(variable == value) %>% summarise_all(~sum(is.na(.)))
...
row.names(na_count) <- rowsna
na_count <- as.data.frame(t(na_count))
na_count$variable

The thing is, I've got no idea how to calc the percentage of missing in the na_count[2 , ] part. I would like some help if possible.

Jorge A
  • 49
  • 9
  • 2
    You will probably attract more answers if you can supply some data along with expected output. You could use `tribble()` from the `tidyverse` to supply hand-entered data (see [here](https://tibble.tidyverse.org/reference/tribble.html)). – Ian Gow Jun 01 '23 at 12:46
  • @Ben Answer gave me this problem: Can't Convert from double to integer due to loss of precision. I understand that It isn't understanding that i want to calc n of NAs in the data / lenght of the data... – Jorge A Jun 02 '23 at 06:44
  • @JorgeA As mentioned above, you need to provide some sample data along with expected output to get an appropriate answer. Please review [this](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to learn how to make a good reproducible example. – Ben Jun 02 '23 at 12:32
  • 1
    Thank you both, Im working on that reproducible example, when i simulate data with NA, it works without issue. It must be something in my dataset. I'll post an update when I know more. – Jorge A Jun 02 '23 at 14:45
  • 'Scoped verbs (⁠_if⁠, ⁠_at⁠, ⁠_all⁠) have been superseded by the use of pick() or across() in an existing verb. See vignette("colwise") for details.' – Mark Jul 19 '23 at 10:39

1 Answers1

1

It sounds like this is what you want:

library(tidyverse)

# toy dataset
df <- tibble(
  id = 1:10,
  x = c(1:9, NA),
  y = c(1:5, rep(NA, 5)),
  z = rep(NA, 10)
)

NA_df <- df %>%
  # we find the number of NAs in each column
  summarise(across(everything(), ~ sum(is.na(.x)))) %>%

  # then we pivot it longer
  pivot_longer(cols = everything()) %>%

  # then find the percentage of NAs in each column
  mutate(mean = 100*value/nrow(df))

# let's say for the sake of argument that we only want to get columns with less than 5 NAs
threshold <- 5

good_columns <- NA_df %>%
  filter(value < threshold) %>%
  pull(name)

# now we can use the good_columns vector to subset the original dataframe
df %>%
  select(all_of(good_columns))

# A tibble: 10 × 2
      id     x
   <int> <int>
 1     1     1
 2     2     2
 3     3     3
 4     4     4
 5     5     5
 6     6     6
 7     7     7
 8     8     8
 9     9     9
10    10    NA
Mark
  • 7,785
  • 2
  • 14
  • 34