I couldn't find what I was looking for anywhere else, so I hope I'm not asking something that is already solved. Sorry if I am.
I want to loop through each column individually for multiple dataframes and apply a function to check the data quality.
I want to find:
- number of missing values
- percentage of missing values
- number of empty rows
- percentage of empty rows
- number of distinct values
- percent of distinct values
- number of duplicates
- percentage of duplicates
- one example of a value in a row that is not empty "" and not missing
- (and any other information you suggest could tell me something about the data quality)
I then want to save the information in a dataframe that I can easily download, looking something like this:
table_name | column_name | # missing values | % missing values | # empty rows | etc...
Can this be done?
I have named my different dataframes "a", "b" and "c" (there are 80, but just for simplifying purposes), and store these in a list called "table_list". These different dataframes varies in number of variables/columns.
I have made this function:
analyze <- function(i) {
data <- table_list[i]
# Find number of missing values
number_missing_values <- sum(is.na(data))
# Find percentage of missing values
percentage_missing_values <- sum(is.na(data)) / nrow(data)
# Find number of empty rows
number_missing_values <- sum(data == "", na.rm = TRUE)
# Find percentage of empty rows
percentage_empty_rows <- sum(data == "", na.rm = TRUE) / nrow(data)
# Find number of distinct values
number_distinct_values <- count(data %>% distinct())
# Find percent of distinct values
percentage_distinct_values <- count(data %>% distinct())/nrow(data)
This function lacks (not sure how to do it):
- number of duplicates
- percentage of duplicates
- one example of a value in a row that is not empty "" and not missing
I was planning to apply this function in this for-loop:
for (i in table_list) {
analyze(i)
}
I'm also not sure how to make the result into a dataframe like i illustrated with the different column names above.
What am I getting wrong here, and what should I do different?