How to dynamically change data type of columns in data frame

Question

The platform from which I'm importing data to R does not support specifying the data type, hence all my columns are character. I have an Excel file that specifies which columns are factor, including the relevant labels and levels. Now, I'm trying to write a function to dynamically change the data type of various columns of my data.frame

Thanks to the excellent answer to this question (dplyr - mutate: use dynamic variable names), I managed to write the following function, in which I dynamically set the column name to the mutate function.

readFactorData <- function(filepath) {
    t <- read.xlsx(filepath)
    sapply(nrow(t), function(i) {
      colname <- as.character(t[i, "Item"])
      factorLevels <- t[i, 3:ncol(t)][which(!is.na(t[i, 3:ncol(t)]))]
      totalLevels <- length(factorLevels)
      listOfLabels <- as.character(unlist(factorLevels))

      mutate(d, !!colname := factor(d[[colname]], labels=(1:totalLevels), levels=listOfLabels))
        # requires dplyr v.0.7+
        # the syntax `!!variablename:=` forces evaluation of the variablename before evaluating the rest of the function
    })
}

It works, and each iteration returns the entire data frame, with the relevant column (colname) changed to factor. But, each iteration overwrites the previous, so this function only returns the last result of i. How do I make sure that I end up with 1 single data frame, in which all the relevant columns are saved?

Sample data (make sure to comment out the first line of the function above, since we're defining t here):

 d <- data.frame("id" = sample(100:999, 10), "age" = sample(18:80, 10), "factor1" = c(rep("a", 3), rep("b", 3), rep("c", 4)), "factor2" = c("x","y","y","y","y","x","x","x","x","y"), stringsAsFactors = FALSE)
 t <- data.frame("Item" = c("factor1","factor2"), "Label" = c("This is factor 1", "This is factor 2"), "level1" = c("a","x"), "level2" = c("b","y"), "level3" = c("c","NA"))

score 1 · Accepted Answer · answered Jan 11 '19 at 12:35

1

If I understand correctly you have one data frame with factor column values of another data frame. You want to extract these from the 1st df and mutate these columns in the 2nd df and turn them into factors.

What about keeping a vector of the column names and then mutate them all?

colnames <- t %>%
  pull(Item) %>%
  as.character()

d_with_factors <- d %>%
  mutate_at(colnames, as.factor)

Then

sapply(d_with_factors, class)

Returns

       id       age   factor1   factor2 
"integer" "integer"  "factor"  "factor"

answered Jan 11 '19 at 12:35

Eden Z

46
2

That would work if I don’t need to manually specify the levels and labels of each factor as well. How would you do that? Since each factor has different levels and labels I need to loop over them, but then when using `sapply()` it doesn’t take the updated data frame, but uses the original one each time, ending up with the results only of the last loop – rvrvrv Jan 11 '19 at 16:58
I'm still not sure why you need to manually add labels and levels. At least in your example with t and d you have the same factor values in both data frames. I also see that you send nrow(t) to your sapply function, which would eventually only send the number 2 to the function (as i). You can use sapply(1:nrow(t), ...) so it'll run on all of your data frame (and not only the last row). – Eden Z Jan 11 '19 at 21:11

score 1 · Answer 2 · answered Jan 11 '19 at 15:19

If you want to convert all your factor to character within your data.frame you can use dplyr's mutate_if. Otherwise, if you want to use a vector of column names, @Eden Z's answer will do it for you.

library(tidyverse)

d_out <- d %>% 
  mutate_if(is.character, as.factor)

d_out

#    id age factor1 factor2
#1  933  61       a       x
#2  208  52       a       y
#3  193  25       a       y
#4  231  47       b       y
#5  595  78       b       y
#6  675  28       b       x
#7  387  71       c       x
#8  386  80       c       x
#9  893  20       c       x
#10 272  23       c       y

When you can check the class for your variables doing:

sapply(d_out, class)

#       id       age   factor1   factor2 
#"integer" "integer"  "factor"  "factor"

Bryan Shalloway · Answer 3 · 2019-01-15T14:15:28.883

The function below maps through readr::parse_* functions specified for each named column you want to change and then allows you to specify args for each (e.g. levels if using parse_factor).

library(tidyverse)

parse_cols <- function(df, f, col_names, levels, ...){
  # df: dataframe, f: char vec, col_names: char vec, levels: list of char vecs,
  # ...: list of other potential args for parse_*
  params_t <- tibble(x = map(col_names, ~df[[.x]]), levels = levels, ...) %>% transpose()

  new_cols <- map2_df(.x = structure(f, names = col_names), 
                      .y = params_t, 
                      ~R.utils::doCall(.x, args = .y, .ignoreUnusedArgs = TRUE))  

  df[names(new_cols)] <- new_cols
  df
}

# function inputs -- perhaps just requiring a tibble input would be safer
parsings_vec <- c("parse_factor","parse_double", "parse_factor")
cols_vec <- c("manufacturer", "cty", "class")
factors_list <- list(unique(mpg[["manufacturer"]]), NULL, unique(mpg[["class"]]))

parse_cols(df = mpg, f = parsings_vec, col_names = cols_vec, levels = factors_list) 
#> # A tibble: 234 x 11
#>    manufacturer model displ  year   cyl trans drv     cty   hwy fl    cla~
#>    <fct>        <chr> <dbl> <int> <int> <chr> <chr> <dbl> <int> <chr> <fc>
#>  1 audi         a4      1.8  1999     4 auto~ f        18    29 p     com~
#>  2 audi         a4      1.8  1999     4 manu~ f        21    29 p     com~
#>  3 audi         a4      2    2008     4 manu~ f        20    31 p     com~
#>  4 audi         a4      2    2008     4 auto~ f        21    30 p     com~
#>  5 audi         a4      2.8  1999     6 auto~ f        16    26 p     com~
#>  6 audi         a4      2.8  1999     6 manu~ f        18    26 p     com~
#>  7 audi         a4      3.1  2008     6 auto~ f        18    27 p     com~
#>  8 audi         a4 q~   1.8  1999     4 manu~ 4        18    26 p     com~
#>  9 audi         a4 q~   1.8  1999     4 auto~ 4        16    25 p     com~
#> 10 audi         a4 q~   2    2008     4 manu~ 4        20    28 p     com~
#> # ... with 224 more rows

How to dynamically change data type of columns in data frame

3 Answers3