I'm attempting to build a function that will give me a summary about the properties of variables in my data. For example, I want to know whether certain variables have digits in their values, or specific characters, or otherwise number of unique values. While this is a simple task when testing a specific variable for a specific property, I don't know how to do it en mass. That is, I want to define different types of tests and pass a vector with column names to to test on each of the tests. I want to build a function that will be scalable/flexible so that I could add or remove tests as necessary.
Example
Let's say that I have some data about people, including age, weight, and sex. Each record is also assigned with a special id string. Although the variables vary in type (i.e., age is numeric but sex is character), I end up having the entire data coerced to characters.
library(tibble)
library(dplyr)
# a function to generate random alphanumeric strings (https://stackoverflow.com/a/42734863/6105259)
sample_strings <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
set.seed(2020)
desired_n <- 20
df <-
tibble(special_id = sample_strings(desired_n),
age = sample(c(10:80), size = desired_n),
weight = sample(seq(45, 100, by = 0.001), size = desired_n, replace = TRUE),
monthly_income = sample(c(2000:15000), size = desired_n),
sex = sample(c("male", "female"), size = desired_n, replace = TRUE)) %>%
mutate_all(as.character)
## # A tibble: 20 x 5
## special_id age weight monthly_income sex
## <chr> <chr> <chr> <chr> <chr>
## 1 LBRKP2319H 20 87.554 4699 male
## 2 WXJZW1202W 60 95.617 3163 female
## 3 VHFFD6488D 22 56.573 3468 female
## 4 XDOBT5680G 19 69.092 13418 female
## 5 ARKYR3604N 64 84.654 7865 female
## 6 QNMCJ9652Q 68 85.483 11704 male
## 7 DNXXA5261W 56 69.472 6075 female
## 8 JLMRO7068U 12 47.802 12712 male
## 9 FPMAV9425D 34 85.468 10120 female
## 10 QOBVI1118X 53 58.419 6984 male
## 11 MNPGP4717P 40 52.546 7040 female
## 12 XGMHF7783I 32 51.613 8443 male
## 13 HTAUX2428V 79 74.555 8339 male
## 14 JRFXW0040B 54 69.734 2394 male
## 15 RTCMZ4379Q 57 87.162 13986 male
## 16 PUYBJ1885Q 13 45.77 8770 female
## 17 VTMEZ2421E 30 51.458 3576 male
## 18 BPKVA5731J 50 82.651 12497 female
## 19 PLYDA2818S 46 95.716 13701 male
## 20 CVESQ8525O 61 56.157 8291 male
Given this data, I want to test the variables for several properties:
- Is alphanumeric, consisting of both digits and alphabet?
- Is numeric, consisting of only digits (decimal point is accepted)?
- Is decimal?
- Has letters only?
- Number of unique values in the column
And I have a method to test each one of those properties:
is alphanumeric
grepl("([A-Za-z]+[0-9]|[0-9]+[A-Za-z])[A-Za-z0-9]*", colname)
is numeric
grepl("^[0-9]{1,}$", colname)
is decimal
grepl("^[-]{0,1}[0-9]{0,}.{0,1}[0-9]{1,}$", colname)
has letters only
grepl("^[a-zA-Z]{1,}$", colname)
number of uniques
length(unique(colname))
Since I expect all values within each column to be consistent in property, I can wrap each of the above (except for number of uniques) with all()
to get a single TRUE
/FALSE
.
Now I want to organize all colnames that I want to test, in one vector:
vec_of_colnames <- c("special_id", "age", "weight", "sex")
And I want to pass vec_of_colnames
to a function that will return a summary table such as:
var_name is_alphanumeric is_numeric is_decimal has_letters_only number_of_uniques
<chr> <lgl> <lgl> <lgl> <lgl> <dbl>
1 special_id TRUE FALSE FALSE FALSE 20
2 age FALSE TRUE FALSE FALSE 20
3 weight FALSE TRUE TRUE FALSE 20
4 sex FALSE FALSE FALSE TRUE 2
How can I achieve such a table, in a way that will easily allow to add or remove "properties tests"?