How can i loop through multiple columns in multiple dataframes in r?

Question

I couldn't find what I was looking for anywhere else, so I hope I'm not asking something that is already solved. Sorry if I am.

I want to loop through each column individually for multiple dataframes and apply a function to check the data quality.

I want to find:

number of missing values
percentage of missing values
number of empty rows
percentage of empty rows
number of distinct values
percent of distinct values
number of duplicates
percentage of duplicates
one example of a value in a row that is not empty "" and not missing
(and any other information you suggest could tell me something about the data quality)

I then want to save the information in a dataframe that I can easily download, looking something like this:

Can this be done?

I have named my different dataframes "a", "b" and "c" (there are 80, but just for simplifying purposes), and store these in a list called "table_list". These different dataframes varies in number of variables/columns.

I have made this function:

analyze <- function(i) {
  data <- table_list[i]
  # Find number of missing values
  number_missing_values <- sum(is.na(data))
  # Find percentage of missing values
  percentage_missing_values <- sum(is.na(data)) / nrow(data)
  # Find number of empty rows
  number_missing_values <- sum(data == "", na.rm = TRUE)
  # Find percentage of empty rows
  percentage_empty_rows <- sum(data == "", na.rm = TRUE) / nrow(data)
  # Find number of distinct values
  number_distinct_values <- count(data %>% distinct())
  # Find percent of distinct values
  percentage_distinct_values <- count(data %>% distinct())/nrow(data)

This function lacks (not sure how to do it):

number of duplicates
percentage of duplicates
one example of a value in a row that is not empty "" and not missing

I was planning to apply this function in this for-loop:

for (i in table_list) {
  analyze(i)
}

I'm also not sure how to make the result into a dataframe like i illustrated with the different column names above.

What am I getting wrong here, and what should I do different?

See https://stackoverflow.com/a/24376207/3358227 for "list of frames" operations. Your first issue is that `for` doesn't return anything, and you run `analyze(i)` and immediately ignore/discard its output. While we don't see all of your function (it is incomplete), it is not working on the data *in-place*, meaning that the changes it makes are temporal only, not on the original data in the original (calling) environment). — r2evans, Nov 05 '20 at 16:47
(1) number of duplicates, `sum(duplicated(...))`; (2) pct of dupes, `sum(duplicated(...))/nrow(...)`. — r2evans, Nov 05 '20 at 17:38

How can i loop through multiple columns in multiple dataframes in r?

0 Answers0