57

I have a large vector of strings of the form:

Input = c("1,223", "12,232", "23,0")

etc. That's to say, decimals separated by commas, instead of periods. I want to convert this vector into a numeric vector. Unfortunately, as.numeric(Input) just outputs NA.

My first instinct would be to go to strsplit, but it seems to me that this will likely be very slow. Does anyone have any idea of a faster option?

There's an existing question that suggests read.csv2, but the strings in question are not directly read in that way.

sebastian-c
  • 15,057
  • 3
  • 47
  • 93
Fhnuzoag
  • 3,810
  • 2
  • 20
  • 16
  • 1
    Why not just replace the commas with decimals (via `gsub`, if memory recalls), and then apply `as.numeric`? –  Mar 05 '13 at 23:55
  • 3
    Not a duplicate. Thousands separators and decimal separators are very different. As.numeric is also for data conversion, not data reading. – Fhnuzoag Nov 11 '16 at 10:43

7 Answers7

76
as.numeric(sub(",", ".", Input, fixed = TRUE))

should work.

adibender
  • 7,288
  • 3
  • 37
  • 41
  • This works fine but it outputs `Warning message: NAs introduced by coercion `. How to avoid it ? – Dan Chaltiel Sep 15 '17 at 12:17
  • the warning or the `NA`s? – adibender Sep 15 '17 at 13:40
  • @HarlanNelson the OP didn't ask about the case with two commas – adibender Nov 20 '18 at 15:13
  • 8
    @adibender That might be true, but the purpose of SO is to help the general community, not a single individual. A generalizeable solution is best. The purpose of answering the OP question is to keep focus not so the answers apply only to narrow cases. – Harlan Nelson Nov 21 '18 at 12:54
  • 1
    @HarlanNelson have to disagree slightly. It's perfectly OK to answer OP's specific question. Also, although a narrow case, I think it's the most common in practice. It's up to other users and OP to decide whether to accept/upvote and more general answers will usually get more upvotes (if otherwise equivalent). If you have a more general solution, I'd invite you to post your answer. If not, I invite you to post a question that asks for a solution to your, more general problem, so others with the same question may find an answer. Both to the benefit of SO. – adibender Nov 21 '18 at 16:22
  • @adibender the OP didn't ask a specific question about cases with only single commas. The question was posed in a general way by stating that the data is 'of the form'. The example to explain that form did only contain numbers with single commas but it is far stretched to assume that there can't be two or more commas. – Sextus Empiricus Oct 21 '22 at 16:29
21

The readr package has a function to parse numbers from strings. You can set many options via the locale argument.

For comma as decimal separator you can write:

readr::parse_number(Input, locale = readr::locale(decimal_mark = ","))
tspano
  • 671
  • 5
  • 10
13
scan(text=Input, dec=",")
## [1]  1.223 12.232 23.000

But it depends on how long your vector is. I used rep(Input, 1e6) to make a long vector and my machine just hangs. 1e4 is fine, though. @adibender's solution is much faster. If we run on 1e4, a lot faster:

Unit: milliseconds
         expr        min         lq     median         uq        max neval
  adibender()   6.777888   6.998243   7.119136   7.198374   8.149826   100
 sebastianc() 504.987879 507.464611 508.757161 510.732661 517.422254   100
sebastian-c
  • 15,057
  • 3
  • 47
  • 93
10

Also, if you are reading in the raw data, the read.table and all the associated functions have a dec argument. eg:

read.table("file.txt", dec=",")

When all else fails, gsub and sub are your friends.

m0nhawk
  • 22,980
  • 9
  • 45
  • 73
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • 1
    I would not have expected the first one to work and it does not seem to with the test case offered: `as.numeric(format(Input, decimal.mark=","))` [1] NA NA NA – IRTFM Mar 06 '13 at 00:28
  • @DWin - strange that there are functions to add separator marks, e.g.: `prettyNum(c(1223,12232,23),big.mark=",",preserve.width="none")` but nothing to go back the other way? – thelatemail Mar 06 '13 at 00:59
  • Except adibender's approach or defining an `As`-method. And `decimal.mark` _would_ be different than `big.mark`. – IRTFM Mar 06 '13 at 01:04
  • @Dwin - Fair enough. I maybe should have been more immediately relevant in my choice of example. – thelatemail Mar 06 '13 at 01:09
  • I'm trying think of design criteria that would auto-detect on basis of maximum numbers of "," and "." in the vector. Maybe a `regexec` approach? – IRTFM Mar 06 '13 at 01:30
  • @DWin, you are correct, thanks. I've gone ahead and edited out the wrong suggestion – Ricardo Saporta Mar 06 '13 at 02:37
4

Building on @adibender solution:

input = '23,67'
as.numeric(gsub(
                # ONLY for strings containing numerics, comma, numerics
                "^([0-9]+),([0-9]+)$", 
                # Substitute by the first part, dot, second part
                "\\1.\\2", 
                input
                ))

I guess that is a safer match...

Deena
  • 5,925
  • 6
  • 34
  • 40
3

As stated by , it's way easier to do this while importing a file. Thw recently released reads package has a very useful features, locale, well explained here, that allows the user to import numbers with comma decimal mark using locale = locale(decimal_mark = ",") as argument.

Emiliano
  • 149
  • 1
  • 9
3

The answer by adibender does not work when there are multiple commas.

In that case the suggestion from use554546 and answer from Deena can be used.

Input = c("1,223,765", "122,325,000", "23,054")
as.numeric(gsub("," ,"", Input))

ouput:

[1] 1223765 122325000 23054

The function gsub replaces all occurences. The function sub replaces only the first.

wibeasley
  • 5,000
  • 3
  • 34
  • 62