0

I couldn't find what I was looking for anywhere else, so I hope I'm not asking something that is already solved. Sorry if I am.

I want to loop through each column individually for multiple dataframes and apply a function to check the data quality.

I want to find:

  • number of missing values
  • percentage of missing values
  • number of empty rows
  • percentage of empty rows
  • number of distinct values
  • percent of distinct values
  • number of duplicates
  • percentage of duplicates
  • one example of a value in a row that is not empty "" and not missing
  • (and any other information you suggest could tell me something about the data quality)

I then want to save the information in a dataframe that I can easily download, looking something like this:

table_name | column_name | # missing values | % missing values | # empty rows | etc...

Can this be done?

I have named my different dataframes "a", "b" and "c" (there are 80, but just for simplifying purposes), and store these in a list called "table_list". These different dataframes varies in number of variables/columns.

I have made this function:

analyze <- function(i) {
  data <- table_list[i]
  # Find number of missing values
  number_missing_values <- sum(is.na(data))
  # Find percentage of missing values
  percentage_missing_values <- sum(is.na(data)) / nrow(data)
  # Find number of empty rows
  number_missing_values <- sum(data == "", na.rm = TRUE)
  # Find percentage of empty rows
  percentage_empty_rows <- sum(data == "", na.rm = TRUE) / nrow(data)
  # Find number of distinct values
  number_distinct_values <- count(data %>% distinct())
  # Find percent of distinct values
  percentage_distinct_values <- count(data %>% distinct())/nrow(data)

This function lacks (not sure how to do it):

  • number of duplicates
  • percentage of duplicates
  • one example of a value in a row that is not empty "" and not missing

I was planning to apply this function in this for-loop:

for (i in table_list) {
  analyze(i)
}

I'm also not sure how to make the result into a dataframe like i illustrated with the different column names above.

What am I getting wrong here, and what should I do different?

Jonaash
  • 3
  • 4
  • See https://stackoverflow.com/a/24376207/3358227 for "list of frames" operations. Your first issue is that `for` doesn't return anything, and you run `analyze(i)` and immediately ignore/discard its output. While we don't see all of your function (it is incomplete), it is not working on the data *in-place*, meaning that the changes it makes are temporal only, not on the original data in the original (calling) environment). – r2evans Nov 05 '20 at 16:47
  • (1) number of duplicates, `sum(duplicated(...))`; (2) pct of dupes, `sum(duplicated(...))/nrow(...)`. – r2evans Nov 05 '20 at 17:38

0 Answers0