189

I would like to change the format (class) of some columns of my data.frame object (mydf) from charactor to factor.

I don't want to do this when I'm reading the text file by read.table() function.

Any help would be appreciated.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Rasoul
  • 3,758
  • 5
  • 26
  • 34

8 Answers8

240

Hi welcome to the world of R.

mtcars  #look at this built in data set
str(mtcars) #allows you to see the classes of the variables (all numeric)

#one approach it to index with the $ sign and the as.factor function
mtcars$am <- as.factor(mtcars$am)
#another approach
mtcars[, 'cyl'] <- as.factor(mtcars[, 'cyl'])
str(mtcars)  # now look at the classes

This also works for character, dates, integers and other classes

Since you're new to R I'd suggest you have a look at these two websites:

R reference manuals: http://cran.r-project.org/manuals.html

R Reference card: http://cran.r-project.org/doc/contrib/Short-refcard.pdf

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • Thanks! but I have another problem. I have the name of each column in an array of characters col_names[]. How can I use the above command (neither `mydf$col_names[i]` nor `mydf[,col_names[i]]` doesn't work.) – Rasoul Feb 12 '12 at 18:41
  • 1
    @Rasoul, `mydf[, col_names]` will do this – DrDom Feb 12 '12 at 18:49
  • 4
    +1 for the refs. This is basic stuff, which is OK to ask, but it's also fine to be aware of the extensive work that has been put into these (and similar) works. – Roman Luštrik Feb 12 '12 at 20:25
101
# To do it for all names
df[] <- lapply( df, factor) # the "[]" keeps the dataframe structure

# to do it for some names in a vector named 'col_names'
col_names <- names(df)
df[col_names] <- lapply(df[col_names] , factor)

Explanation. All dataframes are lists and the results of [ used with multiple valued arguments are likewise lists, so looping over lists is the task of lapply. The above assignment will create a set of lists that the function data.frame.[<- should successfully stick back into into the dataframe, df

Another strategy would be to convert only those columns where the number of unique items is less than some criterion, let's say fewer than the log of the number of rows as an example:

cols.to.factor <- sapply( df, function(col) length(unique(col)) < log10(length(col)) )
df[ cols.to.factor] <- lapply(df[ cols.to.factor] , factor)
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • 1
    This is a very nice solution! It can also work with column numbers which might be especially useful if you wanted to change many but not all. E.g., col_nums <- c(1, 6, 7:9, 21:23, 27:28, 30:31, 39, 49:55, 57) then df[,col_nums] <- lapply(df[,col_nums] , factor). – WGray Aug 08 '14 at 17:17
  • Caveat: the first solution doesn't work if `length(col_names)==1`. In that case, `df[,col_names]` is automatically demoted to a vector instead of a list of length 1, and then `lapply` tries to operate over each entry rather than the column as a whole. This can be prevented by using `df[,col_names,drop=FALSE]`. – P Schnell Sep 11 '16 at 17:14
  • That's a a good point. The other invocation that would retain the list status is to use `df[col_names]`. – IRTFM Sep 11 '16 at 17:53
35

You could use dplyr::mutate_if() to convert all character columns or dplyr::mutate_at() for select named character columns to factors:

library(dplyr)

# all character columns to factor:
df <- mutate_if(df, is.character, as.factor)

# select character columns 'char1', 'char2', etc. to factor:
df <- mutate_at(df, vars(char1, char2), as.factor)
sbha
  • 9,802
  • 2
  • 74
  • 62
  • `mutate_at` is really fast when you have a lot of columns (~50000) and you only have to transform 3. – emr2 Sep 28 '22 at 08:45
18

If you want to change all character variables in your data.frame to factors after you've already loaded your data, you can do it like this, to a data.frame called dat:

character_vars <- lapply(dat, class) == "character"
dat[, character_vars] <- lapply(dat[, character_vars], as.factor)

This creates a vector identifying which columns are of class character, then applies as.factor to those columns.

Sample data:

dat <- data.frame(var1 = c("a", "b"),
                  var2 = c("hi", "low"),
                  var3 = c(0, 0.1),
                  stringsAsFactors = FALSE
                  )
Sam Firke
  • 21,571
  • 9
  • 87
  • 105
  • The complete conversion of every character variable to factor usually happens when reading in data, e.g., with `stringsAsFactors = TRUE`, but this is useful when say, you've read data in with `read_excel()` from the `readxl` package and want to train a random forest model that doesn't accept character variables. – Sam Firke Jan 07 '16 at 22:01
14

Another short way you could use is a pipe (%<>%) from the magrittr package. It converts the character column mycolumn to a factor.

library(magrittr)

mydf$mycolumn %<>% factor
chriad
  • 1,392
  • 15
  • 22
  • Please edit with more information. Code-only and "try this" answers are discouraged, because they contain no searchable content, and don't explain why someone should "try this". We make an effort here to be a resource for knowledge. – Brian Tompsett - 汤莱恩 Jun 24 '16 at 11:13
  • pls if I want t use it for all columns of my df ? – Mostafa90 Jan 26 '17 at 13:50
6

I've doing it with a function. In this case I will only transform character variables to factor:

for (i in 1:ncol(data)){
    if(is.character(data[,i])){
        data[,i]=factor(data[,i])
    }
}
Edu Marín
  • 89
  • 1
  • 5
  • I believe you need double brackets to actually extract the column and change it to a factor, e.g. `[[i]]` – RTrain3k Nov 13 '19 at 16:17
3

Unless you need to identify the columns automatically, I found this to be the simplest solution:

df$name <- as.factor(df$name)

This makes column name in dataframe df a factor.

Christian Lindig
  • 1,216
  • 1
  • 9
  • 24
2

You can use across with new dplyr 1.0.0

library(dplyr)

df <- mtcars 
#To turn 1 column to factor
df <- df %>% mutate(cyl = factor(cyl))

#Turn columns to factor based on their type. 
df <- df %>% mutate(across(where(is.character), factor))

#Based on the position
df <- df %>% mutate(across(c(2, 4), factor))

#Change specific columns by their name
df <- df %>% mutate(across(c(cyl, am), factor))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213