6

I imported an Excel file and got a data frame like this

structure(list(A = structure(1:3, .Label = c("1.100", "2.300", 
"5.400"), class = "factor"), B = structure(c(3L, 2L, 1L), .Label = c("1.000.000", 
"500", "7.800"), class = "factor"), C = structure(1:3, .Label = c("200", 
"3.100", "4.500"), class = "factor")), .Names = c("A", "B", "C"
), row.names = c(NA, -3L), class = "data.frame")

I would now like to convert these chars to numeric or even integer. However, the dot character (.) is not a decimal sign but a "thousand's separator" (it's German).

How would I convert the data frame properly?

I tried this:

df2 <- as.data.frame(apply(df1, 2, gsub, pattern = "([0-9])\\.([0-9])", replacement= "\\1\\2"))

df3 <- as.data.frame(data.matrix(df2))

however, apply seems to convert each column to a list of factors. Can I maybe prevent apply from doing so?

BenMorel
  • 34,448
  • 50
  • 182
  • 322
speendo
  • 13,045
  • 22
  • 71
  • 107
  • If the problem had been that the values contained currency, that question has also been addressed at the level of data input using the `read.*` functions: http://stackoverflow.com/questions/10823241/how-can-i-completely-remove-scientific-notation-for-the-entire-r-session/10823641#10823641 – IRTFM Apr 05 '13 at 15:43
  • looking at the answers to these questions and the solutions offered here (I will accept one of these - I used the one I posted myself but @juba's solution seems to work as well), I think it is not a duplicate... – speendo Apr 05 '13 at 15:52
  • The question to be addressed is _not_ whether his answer is a duplicate but whether the question is a duplicate. You should do more searching before posting questions. – IRTFM Apr 05 '13 at 16:17
  • I found http://stackoverflow.com/questions/2347410/how-can-i-declare-a-thousand-separator-in-read-csv before posting: the OP wants to remove a *comma*, I wanted to remove a *dot*, I also was not able to translate the answers in this thread to my problem. I didn't find http://stackoverflow.com/questions/10823241/how-can-i-completely-remove-scientific-notation-for-the-entire-r-session/10823641#10823641 before, but this adresses a completely different problem. I did a lot of searching before posting (believe it or not). – speendo Apr 05 '13 at 18:51

2 Answers2

9

You can use this :

sapply(df, function(v) {as.numeric(gsub("\\.","", as.character(v)))})

Which gives :

        A       B    C
[1,] 1100    7800  200
[2,] 2300     500 3100
[3,] 5400 1000000 4500

This will give you a matrix object, but you can wrap it into data.frame() if you wish.

Note that the columns in you original data are not characters but factors.


Edit: Alternatively, instead of wrapping it with data.frame(), you can do this to get the result directly as a data.frame:

# the as.character(.) is just in case it's loaded as a factor
df[] <- lapply(df, function(x) as.numeric(gsub("\\.", "", as.character(x))))
Arun
  • 116,683
  • 26
  • 284
  • 387
juba
  • 47,631
  • 14
  • 113
  • 118
2

I think I just found another solution:

It's necessary to use stringsAsFactors = FALSE.

Like this:

df2 <- as.data.frame(apply(df1, 2, gsub, pattern = "([0-9])\\.([0-9])", replacement= "\\1\\2"), stringsAsFactors = FALSE)

df3 <- as.data.frame(data.matrix(df2))
speendo
  • 13,045
  • 22
  • 71
  • 107
  • I guess this'll just replace 2 dots? – Arun Apr 05 '13 at 13:39
  • why do you think just 2 dots? just tried it with `structure(list(A = c("800.000.000.000", "2.034.312.421", "321.325.123.234" ), B = c("800.000.000.000", "2.034.312.421", "321.325.123.234" ), C = c("800.000.000.000", "2.034.312.421", "321.325.123.234" )), .Names = c("A", "B", "C"), row.names = c(NA, -3L), class = "data.frame")` - all dots were replaced. – speendo Apr 05 '13 at 14:34
  • Yes indeed, sorry, I don't know why I said that. However, this wouldn't work if the number was ".578", right? – Arun Apr 05 '13 at 14:36
  • I think it should also work then. In principle this runs the command `gsub("([0-9])\\.([0-9])", "\\1\\2", x)` for every `x` in the data frame. In other words, the function searches for all patterns `.` and replaces them with ``. This should work with all digits - problems could occur with patterns like `..` but here the dot wouldn't be a thousand's seperator anyway. – speendo Apr 05 '13 at 15:04
  • oh, if the number was just `.578` it wouldn't work, right. But in this case the dot is also not a thousand's seperator. – speendo Apr 05 '13 at 15:05
  • Yes, right. Your answer is very well defined to the OP's question. – Arun Apr 05 '13 at 15:10
  • 1
    maybe because of the personal union ;-) – speendo Apr 05 '13 at 15:53
  • hahaha... just noticed!! my God! I need a nap!!! ahahaha this is too hilarious! – Arun Apr 05 '13 at 16:06