133

I have a problem using data.table: How do I convert column classes? Here is a simple example: With data.frame I don't have a problem converting it, with data.table I just don't know how:

df <- data.frame(ID=c(rep("A", 5), rep("B",5)), Quarter=c(1:5, 1:5), value=rnorm(10))
#One way: http://stackoverflow.com/questions/2851015/r-convert-data-frame-columns-from-factors-to-characters
df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
#Another way
df[, "value"] <- as.numeric(df[, "value"])

library(data.table)
dt <- data.table(ID=c(rep("A", 5), rep("B",5)), Quarter=c(1:5, 1:5), value=rnorm(10))
dt <- data.table(lapply(dt, as.character), stringsAsFactors=FALSE) 
#Error in rep("", ncol(xi)) : invalid 'times' argument
#Produces error, does data.table not have the option stringsAsFactors?
dt[, "ID", with=FALSE] <- as.character(dt[, "ID", with=FALSE]) 
#Produces error: Error in `[<-.data.table`(`*tmp*`, , "ID", with = FALSE, value = "c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2)") : 
#unused argument(s) (with = FALSE)

Do I miss something obvious here?

Update due to Matthew's post: I used an older version before, but even after updating to 1.6.6 (the version I use now) I still get an error.

Update 2: Let's say I want to convert every column of class "factor" to a "character" column, but don't know in advance which column is of which class. With a data.frame, I can do the following:

classes <- as.character(sapply(df, class))
colClasses <- which(classes=="factor")
df[, colClasses] <- sapply(df[, colClasses], as.character)

Can I do something similar with data.table?

Update 3:

sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.6.6

loaded via a namespace (and not attached):
[1] tools_2.13.1
Christoph_J
  • 6,804
  • 8
  • 44
  • 58
  • The "[" operator arguments in `data.table` methods are different than they are for `data.frame` – IRTFM Oct 18 '11 at 21:11
  • 1
    Please paste the actual error rather than `#Produces error`. +1 anyway. I don't get any error, which version do you have? There is an issue in this area though, it's been raised before, [FR#1224](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1224&group_id=240&atid=978) and [FR#1493](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1493&group_id=240&atid=978) are high priority to address. Andrie's answer is the best way, though. – Matt Dowle Oct 19 '11 at 09:43
  • Sorry @MatthewDowle for missing that in my question, I updated my post. – Christoph_J Oct 19 '11 at 12:50
  • 1
    @Christoph_J Thanks. Are you sure about that `invalid times argument` error? Work fine for me. Which version do you have? – Matt Dowle Oct 19 '11 at 15:24
  • I updated my post with the sessionInfo(). However, I checked it on my work machine today. Yesterday, on my home machine (Ubuntu) the same error occurred. I will update R and see if the problem is still there. – Christoph_J Oct 19 '11 at 16:31
  • Update: Problem still there with R 2.13.2, I will update the data.table package at home on my Ubuntu machine and see what happens then. – Christoph_J Oct 19 '11 at 16:44

10 Answers10

123

For a single column:

dtnew <- dt[, Quarter:=as.character(Quarter)]
str(dtnew)

Classes ‘data.table’ and 'data.frame':  10 obs. of  3 variables:
 $ ID     : Factor w/ 2 levels "A","B": 1 1 1 1 1 2 2 2 2 2
 $ Quarter: chr  "1" "2" "3" "4" ...
 $ value  : num  -0.838 0.146 -1.059 -1.197 0.282 ...

Using lapply and as.character:

dtnew <- dt[, lapply(.SD, as.character), by=ID]
str(dtnew)

Classes ‘data.table’ and 'data.frame':  10 obs. of  3 variables:
 $ ID     : Factor w/ 2 levels "A","B": 1 1 1 1 1 2 2 2 2 2
 $ Quarter: chr  "1" "2" "3" "4" ...
 $ value  : chr  "1.487145280568" "-0.827845218358881" "0.028977182770002" "1.35392750102305" ...
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • Thanks @Andrie for the answer, that definitely works for this particular problem, but seems to me like a workaround. I update my original question since I do not know in advance which columns I want to convert. With a data.frame, I can write just three lines to convert, for example, between factors and characters for every column in the data.frame(see updated example above), can I do something similar with data.table? Or do I have to check first which column is not of one type (the type I want to convert) and group by this column? – Christoph_J Oct 19 '11 at 13:04
  • 2
    @Christoph_J Please show the grouping command you're struggling with (the real problem). Think you may have missed something simple. Why are you trying to convert column classes? – Matt Dowle Oct 19 '11 at 15:27
  • 2
    @Christoph_J If you struggle to manipulate data.tables, why not simply convert them temporarily to data.frames, do the data cleaning and then convert them back to data.tables? – Andrie Oct 19 '11 at 16:10
  • @Andrie I suspect it's more that his upgrade didn't install properly. You know on Windows where the dll is locked by another R session or something like that. The error messages look like old ones so maybe the old version is still knocking around somehow. – Matt Dowle Oct 19 '11 at 16:31
  • @MatthewDowle: I'm fairly new at stackoverflow, so what is the best way to carry on? Should I select Andrie's answer as the best and post a new question on stackoverflow with the real problem and cross-link from here or should I update my question again (this will be a substantial change though)? Thanks! – Christoph_J Oct 19 '11 at 16:47
  • 1
    @Christoph_J If your "real" question is substantially different, then I suggest you post it as a new question. In that way you will get new eyeballs on the question. – Andrie Oct 19 '11 at 16:55
  • @Christoph_J Yep just accept Anddie's answer and move on. If you upvote some of the comments that would be nice. I still think you might have an installation problem. I see your sessionInfo() output but strange things can happen with namespaces, dlls after upgrades. Please close all R sessions and start again with a fresh session. – Matt Dowle Oct 19 '11 at 17:48
  • Thanks both of you, I will continue as proposed. @MatthewDowle: I just checked it out on my home machine (Ubuntu) with data.table 1.6.6 and I have the same problem there. Really strange... – Christoph_J Oct 19 '11 at 18:26
  • @Christoph_J I am on Ubuntu too testing with 1.6.6. I start R 2.13.2, I type `library(data.table)` then `dt <- data.table(ID=c(rep("A", 5), rep("B",5)), Quarter=c(1:5, 1:5), value=rnorm(10))` and it works fine. What happens for you? – Matt Dowle Oct 19 '11 at 19:21
  • @MatthewDowle Sorry for the misunderstanding, the error does not happen at this line, but rather here: `dt <- data.table(lapply(dt, as.character), stringsAsFactors=FALSE)`, so one line below: error as above: `Error in rep("", ncol(xi)) : invalid 'times' argument` – Christoph_J Oct 19 '11 at 21:17
  • OK, I posted my basic problem [here](http://stackoverflow.com/questions/7828428/what-are-the-restrictions-for-the-column-classes-in-data-table). Took me a while to break it down ;-) – Christoph_J Oct 19 '11 at 21:33
  • 19
    What is the idiomatic way of doing this for a subset of columns (instead of all of them)? I've defined a character vector `convcols` of columns. `dt[,lapply(.SD,as.numeric),.SDcols=convcols]` is almost instant while `dt[,convcols:=lapply(.SD,as.numeric),.SDcols=convcols]` almost freezes up R, so I'm guessing that I'm doing it wrong. Thanks – Frank May 02 '13 at 23:07
  • 5
    @Frank See Matt Dowle's comment to Geneorama's answer below (http://stackoverflow.com/questions/7813578/convert-column-classes-in-data-table?rq=1#comment31200110_20808945); it was helpful and idiomatic enough for me [start quote] Another and easier way is to use `set()` e.g. `for (col in names_factors) set(dt, j=col, value=as.factor(dt[[col]]))` [end quote] – swihart Nov 05 '14 at 18:49
  • Hello @Andrie What if I want to convert all the columns containing dates to character? – skan Jul 14 '16 at 16:36
  • Hello. What if I only want to change the class to some columns? For example I've previoulsy saved a vector with FALSE and TRUE specifying which columns to change to "Date", such as c(T,T,F,F,F,F, T) ? – skan Jul 14 '16 at 23:17
  • Could we use setattr? – skan Mar 06 '17 at 16:37
  • 5
    Why do you use the by=ID option? – skan Mar 20 '17 at 12:19
  • I don't like this for multiple columns because, in general, `by=` changes the order of the columns. – James Hirschorn Mar 11 '18 at 01:41
62

Try this

DT <- data.table(X1 = c("a", "b"), X2 = c(1,2), X3 = c("hello", "you"))
changeCols <- colnames(DT)[which(as.vector(DT[,lapply(.SD, class)]) == "character")]

DT[,(changeCols):= lapply(.SD, as.factor), .SDcols = changeCols]
Nera
  • 721
  • 5
  • 4
12

Raising Matt Dowle's comment to Geneorama's answer (https://stackoverflow.com/a/20808945/4241780) to make it more obvious (as encouraged), you can use for(...)set(...).


library(data.table)

DT = data.table(a = LETTERS[c(3L,1:3)], b = 4:7, c = letters[1:4])
DT1 <- copy(DT)
names_factors <- c("a", "c")

for(col in names_factors)
  set(DT, j = col, value = as.factor(DT[[col]]))

sapply(DT, class)
#>         a         b         c 
#>  "factor" "integer"  "factor"

Created on 2020-02-12 by the reprex package (v0.3.0)

See another of Matt's comments at https://stackoverflow.com/a/33000778/4241780 for more info.

Edit.

As noted by Espen and in help(set), j may be "Column name(s) (character) or number(s) (integer) to be assigned value when column(s) already exist". So names_factors <- c(1L, 3L) will also work.

JWilliman
  • 3,558
  • 32
  • 36
  • You might want to add what `names_factors` is here. I guess it's taken from https://stackoverflow.com/a/20808945/1666063 so it's `names_factors = c('fac1', 'fac2')` in this case - which is column names.But it could also be column numbers for example 1;ncol(dt) which would convert all columns – Espen Riskedal Jan 14 '20 at 19:34
  • @EspenRiskedal Thanks good point, I've edited the post to make it more obvious. – JWilliman Feb 11 '20 at 21:55
2

This is a BAD way to do it! I'm only leaving this answer in case it solves other weird problems. These better methods are the probably partly the result of newer data.table versions... so it's worth while to document this hard way. Plus, this is a nice syntax example for eval substitute syntax.

library(data.table)
dt <- data.table(ID = c(rep("A", 5), rep("B",5)), 
                 fac1 = c(1:5, 1:5), 
                 fac2 = c(1:5, 1:5) * 2, 
                 val1 = rnorm(10),
                 val2 = rnorm(10))

names_factors = c('fac1', 'fac2')
names_values = c('val1', 'val2')

for (col in names_factors){
  e = substitute(X := as.factor(X), list(X = as.symbol(col)))
  dt[ , eval(e)]
}
for (col in names_values){
  e = substitute(X := as.numeric(X), list(X = as.symbol(col)))
  dt[ , eval(e)]
}

str(dt)

which gives you

Classes ‘data.table’ and 'data.frame':  10 obs. of  5 variables:
 $ ID  : chr  "A" "A" "A" "A" ...
 $ fac1: Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5 1 2 3 4 5
 $ fac2: Factor w/ 5 levels "2","4","6","8",..: 1 2 3 4 5 1 2 3 4 5
 $ val1: num  0.0459 2.0113 0.5186 -0.8348 -0.2185 ...
 $ val2: num  -0.0688 0.6544 0.267 -0.1322 -0.4893 ...
 - attr(*, ".internal.selfref")=<externalptr> 
geneorama
  • 3,620
  • 4
  • 30
  • 41
2

If you have a list of column names in data.table, you want to change the class of do:

convert_to_character <- c("Quarter", "value")

dt[, convert_to_character] <- dt[, lapply(.SD, as.character), .SDcols = convert_to_character]
Emil Lykke Jensen
  • 389
  • 1
  • 3
  • 18
  • 3
    This answer is essentially a bad version of @Nera's answer below. Just do `dt[, c(convert_to_character) := lapply(.SD, as.character), .SDcols=convert_to_character]` to assign by reference, rather than using the slower data.frame assignment. – altabq Jan 25 '19 at 10:27
0

I tried several approaches.

# BY {dplyr}
data.table(ID      = c(rep("A", 5), rep("B",5)), 
           Quarter = c(1:5, 1:5), 
           value   = rnorm(10)) -> df1
df1 %<>% dplyr::mutate(ID      = as.factor(ID),
                       Quarter = as.character(Quarter))
# check classes
dplyr::glimpse(df1)
# Observations: 10
# Variables: 3
# $ ID      (fctr) A, A, A, A, A, B, B, B, B, B
# $ Quarter (chr) "1", "2", "3", "4", "5", "1", "2", "3", "4", "5"
# $ value   (dbl) -0.07676732, 0.25376110, 2.47192852, 0.84929175, -0.13567312,  -0.94224435, 0.80213218, -0.89652819...

, or otherwise

# from list to data.table using data.table::setDT
list(ID      = as.factor(c(rep("A", 5), rep("B",5))), 
     Quarter = as.character(c(1:5, 1:5)), 
     value   = rnorm(10)) %>% setDT(list.df) -> df2
class(df2)
# [1] "data.table" "data.frame"
uribo
  • 37
  • 1
  • 4
0

I provide a more general and safer way to do this stuff,

".." <- function (x) 
{
  stopifnot(inherits(x, "character"))
  stopifnot(length(x) == 1)
  get(x, parent.frame(4))
}


set_colclass <- function(x, class){
  stopifnot(all(class %in% c("integer", "numeric", "double","factor","character")))
  for(i in intersect(names(class), names(x))){
    f <- get(paste0("as.", class[i]))
    x[, (..("i")):=..("f")(get(..("i")))]
  }
  invisible(x)
}

The function .. makes sure we get a variable out of the scope of data.table; set_colclass will set the classes of your cols. You can use it like this:

dt <- data.table(i=1:3,f=3:1)
set_colclass(dt, c(i="character"))
class(dt$i)
liqg3
  • 47
  • 4
0

Here is the same way as @Nera suggested to check the class first but instead of using .SD is to use the fast loop of data.table with set as @Matt Dowle solution with added class check.

for (j in seq_len(ncol(DT))){
  if(class(DT[[j]]) == 'factor')
    set(DT, j = j, value = as.character(DT[[j]]))
}
yuskam
  • 310
  • 3
  • 8
0
columnID = c(1,2) # or
columnID = c('column1','column2')


for(i in columnID) class(dt[[i]]) <- 'character'

for loop change the column vector's attribute to character class. It actually treats the data.table as the list type.

asepsiswu
  • 11
  • 2
  • Remember that Stack Overflow isn't just intended to solve the immediate problem, but also to help future readers find solutions to similar problems, which requires understanding the underlying code. This is especially important for members of our community who are beginners, and not familiar with the syntax. Given that, **can you [edit] your answer to include an explanation of what you're doing** and why you believe it is the best approach? – Jeremy Caney Mar 08 '23 at 00:08
-2

try:

dt <- data.table(A = c(1:5), 
                 B= c(11:15))

x <- ncol(dt)

for(i in 1:x) 
{
     dt[[i]] <- as.character(dt[[i]])
}
Jason
  • 945
  • 1
  • 9
  • 17
user151444
  • 11
  • 1