0

I'm attempting to build a function that will give me a summary about the properties of variables in my data. For example, I want to know whether certain variables have digits in their values, or specific characters, or otherwise number of unique values. While this is a simple task when testing a specific variable for a specific property, I don't know how to do it en mass. That is, I want to define different types of tests and pass a vector with column names to to test on each of the tests. I want to build a function that will be scalable/flexible so that I could add or remove tests as necessary.

Example

Let's say that I have some data about people, including age, weight, and sex. Each record is also assigned with a special id string. Although the variables vary in type (i.e., age is numeric but sex is character), I end up having the entire data coerced to characters.

library(tibble)
library(dplyr)

# a function to generate random alphanumeric strings (https://stackoverflow.com/a/42734863/6105259)
sample_strings <- function(n = 5000) {
  a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}

set.seed(2020)

desired_n <- 20

df <-
  tibble(special_id = sample_strings(desired_n),
         age = sample(c(10:80), size = desired_n),
         weight = sample(seq(45, 100, by = 0.001), size = desired_n, replace = TRUE),
         monthly_income = sample(c(2000:15000), size = desired_n),
         sex = sample(c("male", "female"), size = desired_n, replace = TRUE)) %>%
  mutate_all(as.character)

## # A tibble: 20 x 5
##    special_id age   weight monthly_income sex   
##    <chr>      <chr> <chr>  <chr>          <chr> 
##  1 LBRKP2319H 20    87.554 4699           male  
##  2 WXJZW1202W 60    95.617 3163           female
##  3 VHFFD6488D 22    56.573 3468           female
##  4 XDOBT5680G 19    69.092 13418          female
##  5 ARKYR3604N 64    84.654 7865           female
##  6 QNMCJ9652Q 68    85.483 11704          male  
##  7 DNXXA5261W 56    69.472 6075           female
##  8 JLMRO7068U 12    47.802 12712          male  
##  9 FPMAV9425D 34    85.468 10120          female
## 10 QOBVI1118X 53    58.419 6984           male  
## 11 MNPGP4717P 40    52.546 7040           female
## 12 XGMHF7783I 32    51.613 8443           male  
## 13 HTAUX2428V 79    74.555 8339           male  
## 14 JRFXW0040B 54    69.734 2394           male  
## 15 RTCMZ4379Q 57    87.162 13986          male  
## 16 PUYBJ1885Q 13    45.77  8770           female
## 17 VTMEZ2421E 30    51.458 3576           male  
## 18 BPKVA5731J 50    82.651 12497          female
## 19 PLYDA2818S 46    95.716 13701          male  
## 20 CVESQ8525O 61    56.157 8291           male  

Given this data, I want to test the variables for several properties:

  • Is alphanumeric, consisting of both digits and alphabet?
  • Is numeric, consisting of only digits (decimal point is accepted)?
  • Is decimal?
  • Has letters only?
  • Number of unique values in the column

And I have a method to test each one of those properties:

is alphanumeric grepl("([A-Za-z]+[0-9]|[0-9]+[A-Za-z])[A-Za-z0-9]*", colname)

is numeric grepl("^[0-9]{1,}$", colname)

is decimal grepl("^[-]{0,1}[0-9]{0,}.{0,1}[0-9]{1,}$", colname)

has letters only grepl("^[a-zA-Z]{1,}$", colname)

number of uniques length(unique(colname))

Since I expect all values within each column to be consistent in property, I can wrap each of the above (except for number of uniques) with all() to get a single TRUE/FALSE.

Now I want to organize all colnames that I want to test, in one vector:

vec_of_colnames <- c("special_id", "age", "weight", "sex")

And I want to pass vec_of_colnames to a function that will return a summary table such as:

  var_name   is_alphanumeric is_numeric is_decimal has_letters_only number_of_uniques
  <chr>      <lgl>           <lgl>      <lgl>      <lgl>                        <dbl>
1 special_id TRUE            FALSE      FALSE      FALSE                           20
2 age        FALSE           TRUE       FALSE      FALSE                           20
3 weight     FALSE           TRUE       TRUE       FALSE                           20
4 sex        FALSE           FALSE      FALSE      TRUE                             2

How can I achieve such a table, in a way that will easily allow to add or remove "properties tests"?

Emman
  • 3,695
  • 2
  • 20
  • 44

1 Answers1

1

You could write the function as :

library(dplyr)
library(tidyr)


return_stats <- function(data, vec_of_colnames) {
  data %>%
    summarise(across(.cols = all_of(vec_of_colnames), .fns = list(
      isAlphanumeric  = ~all(grepl("([A-Za-z]+[0-9]|[0-9]+[A-Za-z])[A-Za-z0-9]*", .)), 
      isNumeric  = ~all(grepl("^[0-9]{1,}$", .)),
      isDecimal = ~all(grepl("^[-]{0,1}[0-9]{0,}\\.{0,1}[0-9]{1,}$", .)),
      hasLettersOnly = ~all(grepl("^[a-zA-Z]{1,}$", .)),
      numberOfUniques = ~n_distinct(.)
    ))) %>%
    pivot_longer(cols = everything(), 
                 names_to = c('var_name', '.value'), 
                 names_pattern = '(.*)_(.*)')
}

vec_of_colnames <- c("special_id", "age", "weight", "sex")
return_stats(df, vec_of_colnames)

#  var_name   isAlphanumeric isNumeric isDecimal hasLettersOnly numberOfUniques
#  <chr>      <lgl>          <lgl>     <lgl>     <lgl>                    <int>
#1 special_id TRUE           FALSE     FALSE     FALSE                       20
#2 age        FALSE          TRUE      TRUE      FALSE                       20
#3 weight     FALSE          FALSE     TRUE      FALSE                       20
#4 sex        FALSE          FALSE     FALSE     TRUE                         2

Note that I have used regex supplied by OP as it is. It might need some further refining or testing for edge cases.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Perfect, thanks! Yes, regex is not 100% correct, but the functional framework is indeed what I was looking for. One question though, could there be a more foolproof way than `names_pattern = '(.*?)_(is.*|has.*|num.*)'`? I worry this might unintentionally fail if internal function naming is used inconsistently... – Emman Dec 15 '20 at 07:55
  • Yes, I agree. However, we cannot assume all the names data could have. It would have been much more robust if original column names had no underscores or maybe new names. See updated answer which might help changing the new columns names without underscore. – Ronak Shah Dec 15 '20 at 08:02
  • I see, but can't we collect all names of internal functions (e.g., `isAlphanumeric`, `isNumeric`, `isDecimal`, etc.) and feed that into `pivot_longer` directly? – Emman Dec 15 '20 at 08:20
  • 1
    I think my current answer should be robust and should not fail because it is not dependent on what the column name you have in your original dataframe. An `_` is added by default for each column in `vec_of_colnames` when we use `summarise` with `across`. So if your original column name is 'abc' or 'abc_var', it will create a new column as 'abc_isAlphanumeric` and 'abc_var_isAlphanumeric` and in `pivot_longer` I separate them out on last underscore. – Ronak Shah Dec 15 '20 at 08:41