7

How can I (fastest preferable) remove commas from a digit part of a string without affecting the rest of the commas in the string. So in the example below I want to remove the comas from the number portions but the comma after dog should remain (yes I know the comma in 1023455 is wrong but just throwing a corner case out there).

What I have:

x <- "I want to see 102,345,5 dogs, but not too soo; it's 3,242 minutes away"

Desired outcome:

[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"

Stipulation: must be done in base no add on packages.

Thank you in advance.

EDIT: Thank you Dason, Greg and Dirk. Both your responses worked very well. I was playing with something close to Dason's response but had the comma inside the parenthesis. Now looking at it that doesn't even make sense. I microbenchmarked both responses as I need speed here (text data):

Unit: microseconds
         expr     min      lq  median      uq     max
1  Dason_0to9  14.461  15.395  15.861  16.328  25.191
2 Dason_digit  21.926  23.791  24.258  24.725  65.777
3        Dirk 127.354 128.287 128.754 129.686 154.410
4      Greg_1  18.193  19.126  19.127  19.594  27.990
5      Greg_2 125.021 125.954 126.421 127.353 185.666

+1 to all of you.

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • 4
    [What have you tried?](http://whathaveyoutried.com) Hint: R has the ability to do regex replacement. –  Aug 25 '12 at 23:58
  • @GSee I did use the exact example I provided. In a bit I'll post the entire code. I'll throw the perl = TRUE in as you mentioned it for Dirks answer but I didn't think to use it in Dason's. – Tyler Rinker Aug 26 '12 at 14:00

3 Answers3

9

You could replace anything with the pattern (comma followed by a number) with the number itself.

x <- "I want to see 102,345,5 dogs, but not too soo; it's 3,242 minutes away"
gsub(",([[:digit:]])", "\\1", x)
#[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
#or
gsub(",([0-9])", "\\1", x)
#[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
Dason
  • 60,663
  • 9
  • 131
  • 148
  • 2
    It might be more careful to use `([0-9]),([0-9])` for a comma between digits. – Dirk Eddelbuettel Aug 26 '12 at 00:09
  • True. And I considered that at first but got lazy in my solution. My solution should work but you're right in that it would be safest to check for digits on both sides of the comma. – Dason Aug 26 '12 at 00:13
  • Thanks Dason, fastest and easy to understand. I was close to this approach myself. +1 – Tyler Rinker Aug 26 '12 at 02:23
7

Using Perl regexp, and focusing on "digit comma digit" we then replace with just the digits:

R> x <- "I want to see 102,345,5 dogs, but not too soo; it's 3,242 minutes away"
R> gsub("(\\d),(\\d)", "\\1\\2", x, perl=TRUE)
[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
R> 
Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • I don't think `perl=TRUE` is required – GSee Aug 26 '12 at 00:11
  • Good to know as that was one of Tyler's requirements :) – Dirk Eddelbuettel Aug 26 '12 at 00:16
  • Thanks Dirk, very easy to use. I saw something similar with regexing names here a while back [(LINK)](http://stackoverflow.com/questions/10468969/reshape-wide-to-long-with-character-suffixes-instead-of-numeric-suffixes) and use it often but didn't think to apply it to numbers. +1 – Tyler Rinker Aug 26 '12 at 02:30
6

Here are a couple of options:

> tmp <- "I want to see 102,345,5 dogs, but not too soo; it's 3,242 minutes away"
> gsub('([0-9]),([0-9])','\\1\\2', tmp )
[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
> gsub('(?<=\\d),(?=\\d)','',tmp, perl=TRUE)
[1] "I want to see 1023455 dogs, but not too soo; it's 3242 minutes away"
> 

They both match a digit followed by a comma followed by a digit. The [0-9] and \d (the extra \ escapes the second one so that it makes it through to the regular epression) both match a single digit.

The first epression captures the digit before the comma and the digit after the comma and uses them in the replacement string. Basically pulling them out and putting them back (but not putting the comma back).

The second version uses zero-length matches, the (?<=\\d) says that there needs to be a single digit before the comma in order for it to match, but the digit itself is not part of the match. The (?=\\d) says that there needs to be a digit after the comma in order for it to match, but it is not included in the match. So basically it matches a comma, but only if preceded and followed by a digit. Since only the comma is matched, the replacement string is empty meaning delete the comma.

Greg Snow
  • 48,497
  • 6
  • 83
  • 110
  • 1
    your first answer is pretty transparent to me, would you mind expanding in your solution a bit on what's happening with the regex please? – Tyler Rinker Aug 26 '12 at 02:32