95

Given a (pre-existing) data frame that has columns of various types, what is the simplest way to convert all its character columns to factors, without affecting any columns of other types?

Here's an example data.frame:

df <- data.frame(A = factor(LETTERS[1:5]),
                 B = 1:5, C = as.logical(c(1, 1, 0, 0, 1)),
                 D = letters[1:5],
                 E = paste(LETTERS[1:5], letters[1:5]),
                 stringsAsFactors = FALSE)
df
#   A B     C D   E
# 1 A 1  TRUE a A a
# 2 B 2  TRUE b B b
# 3 C 3 FALSE c C c
# 4 D 4 FALSE d D d
# 5 E 5  TRUE e E e
str(df)
# 'data.frame':  5 obs. of  5 variables:
#  $ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
#  $ B: int  1 2 3 4 5
#  $ C: logi  TRUE TRUE FALSE FALSE TRUE
#  $ D: chr  "a" "b" "c" "d" ...
#  $ E: chr  "A a" "B b" "C c" "D d" ...

I know I can do:

df$D <- as.factor(df$D)
df$E <- as.factor(df$E)

Is there a way to automate this process a bit more?

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
Museful
  • 6,711
  • 5
  • 42
  • 68

8 Answers8

123

Roland's answer is great for this specific problem, but I thought I would share a more generalized approach.

DF <- data.frame(x = letters[1:5], y = 1:5, z = LETTERS[1:5], 
                 stringsAsFactors=FALSE)
str(DF)
# 'data.frame':  5 obs. of  3 variables:
#  $ x: chr  "a" "b" "c" "d" ...
#  $ y: int  1 2 3 4 5
#  $ z: chr  "A" "B" "C" "D" ...

## The conversion
DF[sapply(DF, is.character)] <- lapply(DF[sapply(DF, is.character)], 
                                       as.factor)
str(DF)
# 'data.frame':  5 obs. of  3 variables:
#  $ x: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
#  $ y: int  1 2 3 4 5
#  $ z: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5

For the conversion, the left hand side of the assign (DF[sapply(DF, is.character)]) subsets the columns that are character. In the right hand side, for that subset, you use lapply to perform whatever conversion you need to do. R is smart enough to replace the original columns with the results.

The handy thing about this is if you wanted to go the other way or do other conversions, it's as simple as changing what you're looking for on the left and specifying what you want to change it to on the right.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Thanks, very useful, especially after a RMySQL request that gives a dataframe of only character vectors. Just don't forget (like me) to set the proper type of numeric logical, etc. in the columns that are not character beforehand. – Joel.O Jul 12 '16 at 21:09
  • I like that answer. Partly because it's a more thorough solution, and partly because is the most involved use of lapply an sapply I've seen. I'll learn a bit more from that one! – Chris Nov 13 '19 at 18:45
  • You are a legend man! Thank you, I was searching for this for the last 1-2 days. – Vivek Sep 26 '21 at 23:51
76
DF <- data.frame(x=letters[1:5], y=1:5, stringsAsFactors=FALSE)

str(DF)
#'data.frame':  5 obs. of  2 variables:
# $ x: chr  "a" "b" "c" "d" ...
# $ y: int  1 2 3 4 5

You can use as.data.frame to turn all character columns into factor columns:

DF <- as.data.frame(unclass(DF),stringsAsFactors=TRUE)
str(DF)
#'data.frame':  5 obs. of  2 variables:
# $ x: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
# $ y: int  1 2 3 4 5
Roland
  • 127,288
  • 10
  • 191
  • 288
  • 5
    As of a recent version of R, this is no longer necessarily true. The best option appears to be to set the `stringsAsFactors` argument to `TRUE` in the call to `as.data.frame()`. https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/ – Paul de Barros Oct 22 '20 at 00:10
  • Why is the default from as.dato.frame annoying? Seems easy and handy to use – René Martínez Feb 07 '22 at 20:51
  • @RenéMartínez The default has been changed. More often than not, you don't want to transform character columns to factors. – Roland Feb 08 '22 at 06:45
62

As @Raf Z commented on this question, dplyr now has mutate_if. Super useful, simple and readable.

> str(df)
'data.frame':   5 obs. of  5 variables:
 $ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
 $ B: int  1 2 3 4 5
 $ C: logi  TRUE TRUE FALSE FALSE TRUE
 $ D: chr  "a" "b" "c" "d" ...
 $ E: chr  "A a" "B b" "C c" "D d" ...

> df <- df %>% mutate_if(is.character,as.factor)

> str(df)
'data.frame':   5 obs. of  5 variables:
 $ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
 $ B: int  1 2 3 4 5
 $ C: logi  TRUE TRUE FALSE FALSE TRUE
 $ D: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
 $ E: Factor w/ 5 levels "A a","B b","C c",..: 1 2 3 4 5
Community
  • 1
  • 1
Tedward
  • 1,652
  • 15
  • 19
14

Working with dplyr

library(dplyr)

df <- data.frame(A = factor(LETTERS[1:5]),
                 B = 1:5, C = as.logical(c(1, 1, 0, 0, 1)),
                 D = letters[1:5],
                 E = paste(LETTERS[1:5], letters[1:5]),
                 stringsAsFactors = FALSE)

str(df)

we get:

'data.frame':   5 obs. of  5 variables:
 $ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
 $ B: int  1 2 3 4 5
 $ C: logi  TRUE TRUE FALSE FALSE TRUE
 $ D: chr  "a" "b" "c" "d" ...
 $ E: chr  "A a" "B b" "C c" "D d" ...

Now, we can convert all chr to factors:

df <- df%>%mutate_if(is.character, as.factor)
str(df)

And we get:

'data.frame':   5 obs. of  5 variables:
 $ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
 $ B: int  1 2 3 4 5
 $ C: logi  TRUE TRUE FALSE FALSE TRUE
 $ D: chr  "a" "b" "c" "d" ...
 $ E: chr  "A a" "B b" "C c" "D d" ...

Let's provide also other solutions:

With base package:

df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], 
                                                           as.factor)

With dplyr 1.0.0

df <- df%>%mutate(across(where(is.factor), as.character))

With purrr package:

library(purrr)

df <- df%>% modify_if(is.factor, as.character) 
George Pipis
  • 1,452
  • 16
  • 12
  • I am using the line that you used with dplyr 1.0.0 which works (thanks!), but then when I try to change the columns to mutate (e.g. 1:3) or the function (e.g. as.numeric) it fails! What am I not getting, should't it be easy to change the parameters a bit? – Gilrob Aug 10 '23 at 06:28
4

The easiest way would be to use the code given below. It would automate the whole process of converting all the variables as factors in a dataframe in R. it worked perfectly fine for me. food_cat here is the dataset which I am using. Change it to the one which you are working on.

    for(i in 1:ncol(food_cat)){

food_cat[,i] <- as.factor(food_cat[,i])

}
SHREYANSHU
  • 71
  • 1
  • 10
1

I used to do a simple for loop. As @A5C1D2H2I1M1N2O1R2T1 answer, lapply is a nice solution. But if you convert all the columns, you will need a data.frame before, otherwise you will end up with a list. Little execution time differences.

 mm2N=mm2New[,10:18]
 str(mm2N)
'data.frame':   35487 obs. of  9 variables:
 $ bb    : int  4 6 2 3 3 2 5 2 1 2 ...
 $ vabb  : int  -3 -3 -2 -2 -3 -1 0 0 3 3 ...
 $ bb55  : int  7 6 3 4 4 4 9 2 5 4 ...
 $ vabb55: int  -3 -1 0 -1 -2 -2 -3 0 -1 3 ...
 $ zr    : num  0 -2 -1 1 -1 -1 -1 1 1 0 ...
 $ z55r  : num  -2 -2 0 1 -2 -2 -2 1 -1 1 ...
 $ fechar: num  0 -1 1 0 1 1 0 0 1 0 ...
 $ varr  : num  3 3 1 1 1 1 4 1 1 3 ...
 $ minmax: int  3 0 4 6 6 6 0 6 6 1 ...

 # For solution
 t1=Sys.time()
 for(i in 1:ncol(mm2N)) mm2N[,i]=as.factor(mm2N[,i])
 Sys.time()-t1
Time difference of 0.2020121 secs
 str(mm2N)
'data.frame':   35487 obs. of  9 variables:
 $ bb    : Factor w/ 6 levels "1","2","3","4",..: 4 6 2 3 3 2 5 2 1 2 ...
 $ vabb  : Factor w/ 7 levels "-3","-2","-1",..: 1 1 2 2 1 3 4 4 7 7 ...
 $ bb55  : Factor w/ 8 levels "2","3","4","5",..: 6 5 2 3 3 3 8 1 4 3 ...
 $ vabb55: Factor w/ 7 levels "-3","-2","-1",..: 1 3 4 3 2 2 1 4 3 7 ...
 $ zr    : Factor w/ 5 levels "-2","-1","0",..: 3 1 2 4 2 2 2 4 4 3 ...
 $ z55r  : Factor w/ 5 levels "-2","-1","0",..: 1 1 3 4 1 1 1 4 2 4 ...
 $ fechar: Factor w/ 3 levels "-1","0","1": 2 1 3 2 3 3 2 2 3 2 ...
 $ varr  : Factor w/ 5 levels "1","2","3","4",..: 3 3 1 1 1 1 4 1 1 3 ...
 $ minmax: Factor w/ 7 levels "0","1","2","3",..: 4 1 5 7 7 7 1 7 7 2 ...

 #lapply solution
 mm2N=mm2New[,10:18]
 t1=Sys.time()
 mm2N <- lapply(mm2N, as.factor)
 Sys.time()-t1
Time difference of 0.209012 secs
 str(mm2N)
List of 9
 $ bb    : Factor w/ 6 levels "1","2","3","4",..: 4 6 2 3 3 2 5 2 1 2 ...
 $ vabb  : Factor w/ 7 levels "-3","-2","-1",..: 1 1 2 2 1 3 4 4 7 7 ...
 $ bb55  : Factor w/ 8 levels "2","3","4","5",..: 6 5 2 3 3 3 8 1 4 3 ...
 $ vabb55: Factor w/ 7 levels "-3","-2","-1",..: 1 3 4 3 2 2 1 4 3 7 ...
 $ zr    : Factor w/ 5 levels "-2","-1","0",..: 3 1 2 4 2 2 2 4 4 3 ...
 $ z55r  : Factor w/ 5 levels "-2","-1","0",..: 1 1 3 4 1 1 1 4 2 4 ...
 $ fechar: Factor w/ 3 levels "-1","0","1": 2 1 3 2 3 3 2 2 3 2 ...
 $ varr  : Factor w/ 5 levels "1","2","3","4",..: 3 3 1 1 1 1 4 1 1 3 ...
 $ minmax: Factor w/ 7 levels "0","1","2","3",..: 4 1 5 7 7 7 1 7 7 2 ...

 #data.frame lapply solution
 mm2N=mm2New[,10:18]
 t1=Sys.time()
 mm2N <- data.frame(lapply(mm2N, as.factor))
 Sys.time()-t1
Time difference of 0.2010119 secs
 str(mm2N)
'data.frame':   35487 obs. of  9 variables:
 $ bb    : Factor w/ 6 levels "1","2","3","4",..: 4 6 2 3 3 2 5 2 1 2 ...
 $ vabb  : Factor w/ 7 levels "-3","-2","-1",..: 1 1 2 2 1 3 4 4 7 7 ...
 $ bb55  : Factor w/ 8 levels "2","3","4","5",..: 6 5 2 3 3 3 8 1 4 3 ...
 $ vabb55: Factor w/ 7 levels "-3","-2","-1",..: 1 3 4 3 2 2 1 4 3 7 ...
 $ zr    : Factor w/ 5 levels "-2","-1","0",..: 3 1 2 4 2 2 2 4 4 3 ...
 $ z55r  : Factor w/ 5 levels "-2","-1","0",..: 1 1 3 4 1 1 1 4 2 4 ...
 $ fechar: Factor w/ 3 levels "-1","0","1": 2 1 3 2 3 3 2 2 3 2 ...
 $ varr  : Factor w/ 5 levels "1","2","3","4",..: 3 3 1 1 1 1 4 1 1 3 ...
 $ minmax: Factor w/ 7 levels "0","1","2","3",..: 4 1 5 7 7 7 1 7 7 2 ...
xm1
  • 1,663
  • 1
  • 17
  • 28
  • 1
    This will also change all columns. OP asked to change only character columns to factor. – SeGa Oct 25 '19 at 10:04
  • @SeGa one can put an `if(is.character())` to convert only these columns. – xm1 Nov 11 '19 at 12:55
  • 1
    Just a heads up, you don't need to do `data.frame` with the `lapply` approach when converting all columns. Instead of `mm2N <- data.frame(lapply(mm2N, as.factor))` you can do `mm2N[] <- lapply(mm2N, as.factor)`. Note the `[]`. – A5C1D2H2I1M1N2O1R2T1 Jul 07 '20 at 01:57
  • @A5C1D2H2I1M1N2O1R2T1, what does it mean assigning to a `[]`? Isn´t there an implicit conversion to data.frame? Tks! – xm1 Jul 08 '20 at 16:16
  • 1
    @xm1, since a `data.frame` is just a special kind of `list`, you can just re-insert the relevant elements at the specific indices (which is why I had used `[sapply(DF, is.character)]` in my answer). If you use `[]` it replaces everything (provided that the dimensions are correct). Thus, you could do `df <- data.frame(v1 = 1, v2 = 2); df[] <- list("a", "b")` but you can't do `df[] <- list("a", "b", "c")`. – A5C1D2H2I1M1N2O1R2T1 Jul 08 '20 at 17:34
  • 1
    This works for other objects too and is useful for retaining the structure of the original object. Eg, if you had `m <- matrix(1:2, ncol = 2)` and you wanted to match values to letters, if you did `m <- letters[m]`, dimensions are lost. But if you did `m[] <- letters[m]`, they're retained. – A5C1D2H2I1M1N2O1R2T1 Jul 08 '20 at 17:34
0

I noticed "[" indexing columns fails to create levels when iterating:

for ( a_feature in convert.to.factors) {
feature.df[a_feature] <- factor(feature.df[a_feature]) }

It creates, e.g. for the "Status" column:

Status : Factor w/ 1 level "c(\"Success\", \"Fail\")" : NA NA NA ...

Which is remedied by using "[[" indexing:

for ( a_feature in convert.to.factors) {
feature.df[[a_feature]] <- factor(feature.df[[a_feature]]) }

Giving instead, as desired:

. Status : Factor w/ 2 levels "Success", "Fail" : 1 1 2 1 ...

John Mark
  • 311
  • 1
  • 8
0

Based on @Roland 's answer and @Paul de Barros 's comments, I observed to the following conclusion:

    df <- data.frame(A = factor(LETTERS[1:5]),
                 B = 1:5, C = as.logical(c(1, 1, 0, 0, 1)),
                 D = letters[1:5],
                 E = paste(LETTERS[1:5], letters[1:5]),
                 stringsAsFactors = FALSE)
   
   df<-as.data.frame(unclass(df),stringsAsFactors=TRUE)
   str(df)

Practically and simply seems to work.

> str(df)
'data.frame':   5 obs. of  5 variables:
 $ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
 $ B: int  1 2 3 4 5
 $ C: logi  TRUE TRUE FALSE FALSE TRUE
 $ D: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
 $ E: Factor w/ 5 levels "A a","B b","C c",..: 1 2 3 4 5
NCC1701
  • 139
  • 11