How to write a custom function to semi-automate column naming when converting a data object to dataframe

Question

I'm trying to write a function that will take a list object (in a specific format) and return a dataframe. When doing so, I have two criteria that are somewhat conflicting:

The returned dataframe should have column names that are indicative to what each column is about (that is, not col_a, col_b, etc.).
The function should accept a list object that could hold various kinds of data (e.g., weight/height/age/mood/country, etc.)

Those criteria contradict because the function can't know how to name the data columns if it doesn't have prior information. My solution is to semi-automate the function by including an argument that tells what its purpose is when the function is being called.

Therefore, because the function knows its "purpose" when being executed, it knows that one column is going to include (for example) age, another column will include country, etc.

However, how could I be sure that the right columns are being named with the appropriate names, and there is no mismatching (e.g., "age" header is assigned to the weight column)?

I'm trying to work this problem out with tidyverse functions.

1 -- Data object to convert

vec <- c(1, 2, 3)
names(vec) <- c("A", "B", "C")
my_data_object_as_list <- as.list(vec)

my_data_object_as_list
## $A
## [1] 1

## $B
## [1] 2

## $C
## [1] 3

2 -- My custom function for converting

require(tidyr)
require(dplyr)
require(tidyselect)

organize_in_table <-
  function(as_list_object,
           purpose = NULL) {
    table <- as_list_object %>%
      bind_rows() %>%
      pivot_longer(cols = tidyselect::everything())
    
    if (is.null(purpose)) {
      return(table)
    } else if (purpose == "match_letters_and_numbers") {
      table <- rename(table, letters = name, numbers = value)
    }
    return(table)
  }

EDIT
From @akrun's comment I've learned that I could equivalently use:

library(tibble)

organize_in_table <-
  function(as_list_object,
           purpose = NULL) {
    table <- as_list_object %>%
      enframe() %>%
      tidyr::unnest(c(value))
    
    if (is.null(purpose)) {
      return(table)
    } else if (purpose == "match_letters_and_numbers") {
      table <- rename(table, letters = name, numbers = value)
    }
    return(table)
}

3 -- Example for using the function

df_letters_and_numbers <- 
  organize_in_table(my_data_object_as_list, "match_letters_and_numbers")

> df_letters_and_numbers
## # A tibble: 3 x 2
##   letters numbers
##   <chr>     <dbl>
## 1 A             1
## 2 B             2
## 3 C             3

4 -- Demonstration of potential problem

Data to be converted

vec_2 <- c("A", "B", "C")
names(vec_2) <- c(1, 2, 3)
my_data_object_as_list_2 <- as.list(vec_2)

> my_data_object_as_list_2 
## $`1`
## [1] "A"

## $`2`
## [1] "B"

## $`3`
## [1] "C"

Conversion ends up with mismatching column names

organize_in_table(my_data_object_as_list_2, "match_letters_and_numbers")

## # A tibble: 3 x 2
##   letters numbers
##   <chr>   <chr>  
## 1 1       A      
## 2 2       B      
## 3 3       C

The key point to keep in mind is that this function should potentially accept any kind of data (e.g., age, weight, distance, dominant personality trait, name, driver license ID, etc.). The user executing the function is responsible to detail the properties of the variable being included.

Below are two examples for types of data that need certain validation before renaming. Provided with purpose argument, organize_in_table() should know which "validating functions" are relevant to refer to before returning the column-named dataframe.

Example #1 -- Matching Greek words and equivalent words in English
Data

vec_greek <- c("σκύλος", "Γάτα", "ζέβρα")
names(vec_greek) <- c("dog", "cat", "zebra")
data_object_greek_english <- as.list(vec_greek)

data_object_greek_english
## $dog
## [1] "sκύλος"

## $cat
## [1] "Gάta"

## $zebra
## [1] "ζέßρa"

Validating functions

Is Greek?

grepl("[\u0370-\u03ff\u1f00-\u1fff]+", x)

Is English?

library(stringi)
stri_enc_isascii()

Desired Output

## regardless of whether data object is "data_object_greek_english_1": 
vec_greek <- c("σκύλος", "Γάτα", "ζέβρα")
names(vec_greek) <- c("dog", "cat", "zebra")
data_object_greek_english_1 <- as.list(vec_greek)
## or "data_object_greek_english_2":
vec_english <- c("dog", "cat", "zebra")
names(vec_english) <- c("σκύλος", "Γάτα", "ζέβρα")
data_object_greek_english_2 <- as.list(vec_english)

## the call:
organize_in_table(data_object_greek_english_1, purpose = "match_greek_and_english")
## should return the same output as:
organize_in_table(data_object_greek_english_2, purpose = "match_greek_and_english")

## # A tibble: 3 x 2
##   english greek   ## position of columns doesn't matter as long as headers are appropriate to values
##   <chr>   <chr> 
## 1 dog     sκύλος
## 2 cat     Gάta  
## 3 zebra   ζέßρa

Example #2 -- Matching phone numbers and California driver license ID
Data
^{_{Data below is absolutely made up}}

vec_driver_license <- c("F2849563", "I2938461", "B2293890")
names(vec_driver_license) <- c("626-710-9060", "831-263-9154", "510-923-6869")
data_object_phone_dl <- as.list(vec_driver_license)

data_object_phone_dl
## $`626-710-9060`
## [1] "F2849563"

## $`831-263-9154`
## [1] "I2938461"

## $`510-923-6869`
## [1] "B2293890"

Validating functions

Is phone number?

grepl("^\\s*(\\+\\s*1(-?|\\s+))*[0-9]{3}\\s*-?\\s*[0-9]{3}\\s*-?\\s*[0-9]{4}$", x)

Is driver license ID?

grepl("^[A-Z]{1}\\d{7}$", x)

Desired Output

## regardless of whether data object is "data_object_phone_dl_1": 
vec_driver_license <- c("F2849563", "I2938461", "B2293890")
names(vec_driver_license) <- c("626-710-9060", "831-263-9154", "510-923-6869")
data_object_phone_dl_1 <- as.list(vec_driver_license)
## or "data_object_phone_dl_2":
vec_phone_number <- c("626-710-9060", "831-263-9154", "510-923-6869")
names(vec_phone_number) <- c("F2849563", "I2938461", "B2293890")
data_object_phone_dl_2 <- as.list(vec_phone_number)

## the call:
organize_in_table(data_object_phone_dl_1, purpose = "match_phone_and_dl")
## should return the same output as:
organize_in_table(data_object_phone_dl_2, purpose = "match_phone_and_dl")

## # A tibble: 3 x 2
##   phone_number driver_license_id  ## position of columns doesn't matter as long as headers are appropriate to values
##   <chr>        <chr>            
## 1 626-710-9060 F2849563         
## 2 831-263-9154 I2938461         
## 3 510-923-6869 B2293890

My take on it is that you should write a function that will be tasked with identifying the type of data in the list in order to determine its nature. The function would check the criteria that you mentioned (e.g. positive integer that can't be greater than 110...) and then assign a column name accordingly. If you mention examples of the different types of values that you expect, the community here can help you write appropriate functions. — SavedByJESUS, Dec 25 '20 at 15:45
Wouldn't this be easier with the already available functions i.e. `stack(my_data_object_as_list)` from `base R` or `enframe(my_data_object_as_list) %>% unnest(c(value))` With `enframe`, you can also specify the names of the columns — akrun, Dec 25 '20 at 17:47
@SavedByJESUS -- I've just edited the post to provide two use case examples with relevant criteria-checking functions. Thanks! — Emman, Dec 26 '20 at 13:23
@akrun, your suggestions are neat, but don't address the renaming issue. I've just edited the post to add more concrete demonstrations of what I'm trying to achieve. Thanks — Emman, Dec 26 '20 at 13:24

SavedByJESUS · Accepted Answer · 2020-12-27T14:25:33.270

1

I split the final solution into two distinct functions; however, you may want to nest them if you so wish. Also, I got rid of the purpose argument. You can definitely put it back if it absolutely serves your purpose:

# Load packages
library(dplyr)

# Make data
vec_driver_license <- c("F2849563", "I2938461", "B2293890")
names(vec_driver_license) <- c("626-710-9060", "831-263-9154", "510-923-6869")
data_object_phone_dl_1 <- as.list(vec_driver_license)

vec_phone_number <- c("626-710-9060", "831-263-9154", "510-923-6869")
names(vec_phone_number) <- c("F2849563", "I2938461", "B2293890")
data_object_phone_dl_2 <- as.list(vec_phone_number)

# Create custom functions

check_content <- function(x){
  
  if(all(grepl("[\u0370-\u03ff\u1f00-\u1fff]+", x))){
    out <- "greek"
  } else if(all(grepl("^[A-Z]{1}\\d{7}$", x))){
    out <- "driver_license"
  } else if(all(grepl("^\\s*(\\+\\s*1(-?|\\s+))*[0-9]{3}\\s*-?\\s*[0-9]{3}\\s*-?\\s*[0-9]{4}$", x))){
    out <- "phone_number"
  } else if(all(grepl("^[A-Z]{1}\\d{7}$", x))){
    out <- "driver_license_id"
  } else {
    out <- "undefined"
  }
  
  out
  
}

organize_in_table <- function(data_list){
  
  df <- tibble::enframe(data_list) %>%
    tidyr::unnest(cols = value)
  
  colnames(df) <- purrr::map_chr(df, check_content)
  
  df
}

# Demo with data_object_phone_dl_1
organize_in_table(data_object_phone_dl_1)

# A tibble: 3 x 2
  phone_number driver_license
  <chr>        <chr>         
1 626-710-9060 F2849563      
2 831-263-9154 I2938461      
3 510-923-6869 B2293890 

# Demo with data_object_phone_dl_2
organize_in_table(data_object_phone_dl_2)

# A tibble: 3 x 2
  driver_license phone_number
  <chr>          <chr>       
1 F2849563       626-710-9060
2 I2938461       831-263-9154
3 B2293890       510-923-6869

edited Dec 27 '20 at 14:25

answered Dec 26 '20 at 18:00

SavedByJESUS

3,262
4
32
47

Thanks for these functions! However, when I tried running `organize_in_table(data_object_phone_dl_1)` it returned (via `dput()`) `structure(list(driver_license = "F2849563", driver_license = "I2938461", driver_license = "B2293890"), row.names = c(NA, -1L), class = c("tbl_df", "tbl", "data.frame"))`. I can't locate where to correct this in the function though... do you have an idea? – Emman Dec 26 '20 at 19:57
This is because the functions, as I wrote them, expect character strings of the same type to be received in a vector. This is what I tried to exemplify with my `data_list`. So you should try: `organize_in_table(vec_driver_license)` instead (the vector counterpart of `data_object_phone_dl_1`. – SavedByJESUS Dec 26 '20 at 23:47
I see. But my situation is particular, and `vec_driver_license` doesn't reflect it. My data arrives in the format of `data_object_phone_dl_1`, meaning that both the underlying vector ***and*** the superimposed names are in fact part of the data. Therefore, it is the *pairing* between vector elements and names that comprises the data. This is why `vec_driver_license` doesn't stand on its own, and without the accompanying names it's meaningless. I specified `vec_driver_license` only to make this post reproducible. In reality, objects such as `data_object_phone_dl_1` come from a `json` file. – Emman Dec 27 '20 at 07:08
@Emman Okay, I understand better now. I just edited the code to suit your needs. – SavedByJESUS Dec 27 '20 at 07:35
Awesome, this looks great, thank you! As far as I can see, the line `data_list <- data_object_phone_dl_1` became unnecessary, right? – Emman Dec 27 '20 at 07:44
You're right. I just used it to test the code inside the function and forgot to remove it before posting the answer. – SavedByJESUS Dec 27 '20 at 14:25