42

I seem to spend a lot of time creating a dataframe from a file, database or something, and then converting each column into the type I wanted it in (numeric, factor, character etc). Is there a way to do this in one step, possibly by giving a vector of types ?

foo<-data.frame(x=c(1:10), 
                y=c("red", "red", "red", "blue", "blue", 
                    "blue", "yellow", "yellow", "yellow", 
                    "green"),
                z=Sys.Date()+c(1:10))

foo$x<-as.character(foo$x)
foo$y<-as.character(foo$y)
foo$z<-as.numeric(foo$z)

instead of the last three commands, I'd like to do something like

foo<-convert.magic(foo, c(character, character, numeric))
PaulHurleyuk
  • 8,009
  • 15
  • 54
  • 78
  • 8
    Use the `colClasses` argument to `read.table`. – Joshua Ulrich Oct 06 '11 at 21:58
  • Ranges of values can also be assigned simply using: `for(n in names(foo)[1:2]{foo[[n]]<-as.character(foo[[n]])}` Convenient for lots of columns to convert. – RichT Apr 08 '15 at 23:11
  • Learned if converting multiple fields from factor to numeric you will need another call to `as.character` or `levels`. see: http://stackoverflow.com/questions/3418128/how-to-convert-a-factor-to-an-integer-numeric-without-a-loss-of-information – RichT Apr 22 '15 at 20:05

11 Answers11

35

Edit See this related question for some simplifications and extensions on this basic idea.

My comment to Brandon's answer using switch:

convert.magic <- function(obj,types){
    for (i in 1:length(obj)){
        FUN <- switch(types[i],character = as.character, 
                                   numeric = as.numeric, 
                                   factor = as.factor)
        obj[,i] <- FUN(obj[,i])
    }
    obj
}

out <- convert.magic(foo,c('character','character','numeric'))
> str(out)
'data.frame':   10 obs. of  3 variables:
 $ x: chr  "1" "2" "3" "4" ...
 $ y: chr  "red" "red" "red" "blue" ...
 $ z: num  15254 15255 15256 15257 15258 ...

For truly large data frames you may want to use lapply instead of the for loop:

convert.magic1 <- function(obj,types){
    out <- lapply(1:length(obj),FUN = function(i){FUN1 <- switch(types[i],character = as.character,numeric = as.numeric,factor = as.factor); FUN1(obj[,i])})
    names(out) <- colnames(obj)
    as.data.frame(out,stringsAsFactors = FALSE)
}

When doing this, be aware of some of the intricacies of coercing data in R. For example, converting from factor to numeric often involves as.numeric(as.character(...)). Also, be aware of data.frame() and as.data.frame()s default behavior of converting character to factor.

Community
  • 1
  • 1
joran
  • 169,992
  • 32
  • 429
  • 468
  • +1 for posterity, although I don't understand what the difference is. – Brandon Bertelsen Oct 07 '11 at 00:21
  • @BrandonBertelsen If you're referring to the use of `switch`, for me it's mostly aesthetic and makes the code easier to read. Not sure if there's any performance difference, although I suspect not. – joran Oct 07 '11 at 00:25
  • 2
    +1 for recommending `lapply`. I've struggled to optimise this type of problem in the past, and it turns out that the `[<-` operation is rather slow. – Andrie Oct 07 '11 at 09:24
  • @Andrie How time flies. Now that `data.table` has `:=` that's the way to go for this question, should be much much faster. `[<-` was slow because it was copying the whole data.frame every time one column changed. If I add an answer, will need votes and then ask OP to change accept? – Matt Dowle Jun 29 '12 at 17:22
  • 1
    Does this function convert numeric factors to numeric (i.e 3.6 = 3.6, not the factor order number)? How to incorporate that into the function? I tried as.numeric(as.character), which does not work. – Mikko Jan 22 '13 at 10:52
  • 1
    @MatthewDowle: mind posting the data.table solution? Haven't done to much with it yet, so this isn't necessarily a no-brainer for me. Sounds interesting, though. – Matt Bannert Jun 03 '13 at 14:11
  • 1
    @MattBannert Hi. A looped `set` as in the last edit in [this answer](http://stackoverflow.com/a/16846530/403310) is the way I'd do this. Replace the `-` with a call to `as(...)`, or similar. – Matt Dowle Jun 03 '13 at 14:52
  • Is there a way to use the actual column names instead of partnering the type to the index? –  Feb 03 '14 at 16:50
  • I keep getting the error `Error in FUN(1:11[[11L]], ...) : could not find function "FUN1"`. Any thoughts of what I am doing wrong? – Kevin Apr 18 '14 at 23:31
  • @Mikko: did you ever figure out how to handle the as.numeric(as.character) issue? – theforestecologist Dec 05 '15 at 18:19
  • @theforestecologist Yes. I added it as an answer to this thread. – Mikko Dec 07 '15 at 08:26
22

If you want to automatically detect the columns data-type rather than manually specify it (e.g. after data-tidying, etc.), the function type.convert() may help.

The function type.convert() takes in a character vector and attempts to determine the optimal type for all elements (meaning that it has to be applied once per column).

df[] <- lapply(df, function(x) type.convert(as.character(x)))

Since I love dplyr, I prefer:

library(dplyr)
df <- df %>% mutate_all(funs(type.convert(as.character(.))))
Sam
  • 387
  • 1
  • 6
  • 15
Luke Hankins
  • 588
  • 5
  • 13
  • Your first option should probably be `df[] <- lapply(df, function(x) type.convert(as.character(x))`. I'd drop the `apply` option since it's usually intended to produce a matrix or array. You've misspelled `dplyr` in your third option. Finally, this isn't really an answer to the OP's question, but rather to a related question. – Nick Kennedy Jul 13 '15 at 08:50
  • 2
    Thanks for the formatting tips. It took me a long time to find a feature like type.convert, so I thought placing it here on a similar issue that came up more frequently would help someone like me down the road. – Luke Hankins Jul 14 '15 at 05:32
  • fair enough, though worth perhaps looking at [this question](http://stackoverflow.com/questions/28254971/re-convert-data-types-in-r) – Nick Kennedy Jul 14 '15 at 15:23
  • Alternative (less verbose and more informative) approach using ``readr`` package: ``df <- type_convert(df)``. – runr Apr 07 '22 at 14:29
8

I find I run into this a lot as well. This is about how you import data. All of the read...() functions have some type of option to specify not converting character strings to a factor. Meaning that text strings will stay character and things that look like numbers will stay as numbers. A problem arises when you have elements that are empty and not NA. But again, na.strings = c("",...) should solve that as well. I'd start by taking a hard look at your import process and adjusting it accordingly.

But you could always create a function and push this string through.

convert.magic <- function(x, y=NA) {
for(i in 1:length(y)) { 
if (y[i] == "numeric") { 
x[i] <- as.numeric(x[[i]])
}
if (y[i] == "character")
x[i] <- as.character(x[[i]])
}
return(x)
}

foo <- convert.magic(foo, c("character", "character", "numeric"))

> str(foo)
'data.frame':   10 obs. of  3 variables:
 $ x: chr  "1" "2" "3" "4" ...
 $ y: chr  "red" "red" "red" "blue" ...
 $ z: num  15254 15255 15256 15257 15258 ...
Brandon Bertelsen
  • 43,807
  • 34
  • 160
  • 255
  • 1
    Try replace the `if` statements with a call to `switch`, which can actually return the appropriate function: `switch(expr,character = as.character, numeric = as.numeric,...)`. – joran Oct 06 '11 at 22:28
  • meh, write it as an answer so you can get bonus points :) I just whiped something up quick. – Brandon Bertelsen Oct 06 '11 at 22:29
7

I know I am quite late to answer, but using a loop along with the attributes function is a simple solution to your problem.

names <- c("x", "y", "z")
chclass <- c("character", "character", "numeric")

for (i in (1:length(names))) {
  attributes(foo[, names[i]])$class <- chclass[i]
}
jay.sf
  • 60,139
  • 8
  • 53
  • 110
SeaJane
  • 178
  • 1
  • 7
2

I just ran into something like this with RSQLite fetch method... the results come back as atomic data types. In my case, it was a date time stamp that was causing me frustration. I found that the setAs function is very useful for helping make as work as expected. Here is my small example case.

##data.frame conversion function
convert.magic2 <- function(df,classes){
  out <- lapply(1:length(classes),
                FUN = function(classIndex){as(df[,classIndex],classes[classIndex])})
  names(out) <- colnames(df)
  return(data.frame(out))
}

##small example case
tmp.df <- data.frame('dt'=c("2013-09-02 09:35:06", "2013-09-02 09:38:24", "2013-09-02 09:38:42", "2013-09-02 09:38:42"),
                     'v'=c('1','2','3','4'),
                     stringsAsFactors=FALSE)
classes=c('POSIXct','numeric')
str(tmp.df)
#confirm that it has character datatype columns
##  'data.frame':  4 obs. of  2 variables:
##    $ dt: chr  "2013-09-02 09:35:06" "2013-09-02 09:38:24" "2013-09-02 09:38:42" "2013-09-02 09:38:42"
##    $ v : chr  "1" "2" "3" "4"

##is the dt column coerceable to POSIXct?
canCoerce(tmp.df$dt,"POSIXct")
##  [1] FALSE

##and the conver.magic2 function fails also:
tmp.df.n <- convert.magic2(tmp.df,classes)

##  Error in as(df[, classIndex], classes[classIndex]) : 
##    no method or default for coercing “character” to “POSIXct” 

##ittle reading reveals the setAS function
setAs('character', 'POSIXct', function(from){return(as.POSIXct(from))})

##better answer for canCoerce
canCoerce(tmp.df$dt,"POSIXct")
##  [1] TRUE

##better answer from conver.magic2
tmp.df.n <- convert.magic2(tmp.df,classes)

##column datatypes converted as I would like them!
str(tmp.df.n)

##  'data.frame':  4 obs. of  2 variables:
##    $ dt: POSIXct, format: "2013-09-02 09:35:06" "2013-09-02 09:38:24" "2013-09-02 09:38:42" "2013-09-02 09:38:42"
##   $ v : num  1 2 3 4
Osunderdog
  • 19
  • 4
2

Similar to type.convert(foo, as.is = TRUE) there is also readr::type_convert which converts the dataframe to appropriate class without specifying them

readr::type_convert(foo) 

If you keep all columns as character we could also use readr::parse_guess which would automatically convert the dataframe into correct classes. Consider this modified dataframe

foo <- data.frame(x = as.character(1:10), 
                  y = c("red", "red", "red", "blue", "blue", "blue", "yellow", 
                     "yellow", "yellow", "green"),
                  z = as.character(Sys.Date()+c(1:10)), stringsAsFactors = FALSE)

str(foo)

#'data.frame':  10 obs. of  3 variables:
# $ x: chr  "1" "2" "3" "4" ...
# $ y: chr  "red" "red" "red" "blue" ...
# $ z: chr  "2019-08-12" "2019-08-13" "2019-08-14" "2019-08-15" ...

Applying parse_guess on each column

foo[] <- lapply(foo, readr::parse_guess)

#'data.frame':  10 obs. of  3 variables:
# $ x: num  1 2 3 4 5 6 7 8 9 10
# $ y: chr  "red" "red" "red" "blue" ...
# $ z: Date, format: "2019-08-12" "2019-08-13" "2019-08-14" "2019-08-15" ...
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

Addition to @joran's answer, in which convert.magic would not preserve numeric values in factor-to-numeric conversion:

convert.magic <- function(obj,types){
    out <- lapply(1:length(obj),FUN = function(i){FUN1 <- switch(types[i],
    character = as.character,numeric = as.numeric,factor = as.factor); FUN1(obj[,i])})
    names(out) <- colnames(obj)
    as.data.frame(out,stringsAsFactors = FALSE)
}

foo<-data.frame(x=c(1:10), 
                    y=c("red", "red", "red", "blue", "blue", 
                        "blue", "yellow", "yellow", "yellow", 
                        "green"),
                    z=Sys.Date()+c(1:10))

foo$x<-as.character(foo$x)
foo$y<-as.character(foo$y)
foo$z<-as.numeric(foo$z)

str(foo)
# 'data.frame': 10 obs. of  3 variables:
# $ x: chr  "1" "2" "3" "4" ...
# $ y: chr  "red" "red" "red" "blue" ...
# $ z: num  16777 16778 16779 16780 16781 ...

foo.factors <- convert.magic(foo, rep("factor", 3))

str(foo.factors) # all factors

foo.numeric.not.preserved <- convert.magic(foo.factors, c("numeric", "character", "numeric"))

str(foo.numeric.not.preserved)
# 'data.frame': 10 obs. of  3 variables:
# $ x: num  1 3 4 5 6 7 8 9 10 2
# $ y: chr  "red" "red" "red" "blue" ...
# $ z: num  1 2 3 4 5 6 7 8 9 10

# z comes out as 1 2 3...

Following should preserve the numeric values:

## as.numeric function that preserves numeric values when converting factor to numeric

as.numeric.mod <- function(x) {
    if(is.factor(x))
      as.numeric(levels(x))[x]
  else
      as.numeric(x)
}

## The same than in @joran's answer, except for as.numeric.mod
convert.magic <- function(obj,types){
    out <- lapply(1:length(obj),FUN = function(i){FUN1 <- switch(types[i],
    character = as.character,numeric = as.numeric.mod, factor = as.factor); FUN1(obj[,i])})
    names(out) <- colnames(obj)
    as.data.frame(out,stringsAsFactors = FALSE)
}

foo.numeric <- convert.magic(foo.factors, c("numeric", "character", "numeric"))

str(foo.numeric)
# 'data.frame': 10 obs. of  3 variables:
# $ x: num  1 2 3 4 5 6 7 8 9 10
# $ y: chr  "red" "red" "red" "blue" ...
# $ z: num  16777 16778 16779 16780 16781 ...

# z comes out with the correct numeric values
Mikko
  • 7,530
  • 8
  • 55
  • 92
1

A somewhat simple data.table solution, though it will take a few steps if you are changing to a lot of different column types.

dt <- data.table( x=c(1:10), y=c(10:20), z=c(10:20), name=letters[1:10])

dt <- dt[, lapply(.SD, as.numeric), by= name]

This will change all the columns except those specified in by to numeric (or whatever you set in lapply)

moman822
  • 1,904
  • 3
  • 19
  • 33
1

There is a simple solution in the package hablar

Code

library(hablar)
library(dplyr)
df <- data.frame(x = "1", y = "2", z = "4")

df %>% 
  convert(int(x, z),
          chr(y))

Result

# A tibble: 1 x 3
      x y         z
  <int> <chr> <int>
1     1 2         4

You can simply put multiple column names to convert multiple columns, e.g. z and z to integer as in the example above.

davsjob
  • 1,882
  • 15
  • 10
0

Transform is what you seem to describe:

foo <- transform(foo, x=as.character(x), y=as.character(y), z=as.numeric(z))
pogibas
  • 27,303
  • 19
  • 84
  • 117
leo277
  • 439
  • 2
  • 7
0

Using purrr and base:

foo<-data.frame(x=c(1:10), 
                y=c("red", "red", "red", "blue", "blue", 
                    "blue", "yellow", "yellow", "yellow", 
                    "green"),
                z=Sys.Date()+c(1:10))
types <- c("character", "character", "numeric")
types<-paste0("as.",types)
purrr::map2_df(foo,types,function(x,y) do.call(y,list(x)))
# A tibble: 10 x 3
   x     y          z
   <chr> <chr>  <dbl>
 1 1     red    18127
 2 2     red    18128
 3 3     red    18129
 4 4     blue   18130
NelsonGon
  • 13,015
  • 7
  • 27
  • 57