Why does using gsub() in R to delete dollar sign actually change the data?

Question

There are signs of "$" and "," in the Price variable in my dataframe as the head(data) and str(data) show.

I tried to delete the signs using gsub():

data_new <- gsub("[$,]", "", data)

I also tried:

data_new <- gsub("[\\$,]", "", data)

But when I checked the data_new using head(data), it turned: image "c(59 32 60 56 52 95 4 47 32 293 353 23 25 119 280 330 172 65 73 370 22 32 383 65 14 26 172 106 43 59 297 32 315 50 315 363 25 254 353 230 383 23 76 209 17 378 37 105 365 353 17 95 69 105 59 353 52 254 94 172 331 383 330 95 353 172 341 242 280 59 25 353 131 156 49 383..." Thanks to your ideas, what I am doing now is:

# delete "$" and "," sign
data_price <- gsub("[\\$,]", "", data$price)

# select other variables in the data and combine the price vector to create 
  a new data frame.
df <- data.frame(price = data_price, room_type =  data$room_type, 
        accommodates =  data$accommodates, bedrooms =  data$bedrooms, 
        bathrooms =  data$bathrooms, beds =  data$beds, 
        review_scores_rating = data$review_scores_rating)

Though it works, I have a few questions:

Why did the previous way alter the data is changed? Is it common in data cleaning and preparation?
Any other way works better to delete $ and , in the Price variable but keep all other information same as before? "Better", I mean, more concise code.

Here are the first 12 observations head(data, 12):

    price   room_type   accommodates bedrooms  bathrooms beds
    <fctr>  <fctr>        <int> <int> <dbl> <int>
1   $150.00 Entire home/apt   6   2     2   4   
2   $119.00 Entire home/apt   4   0     1   2   
3   $151.00 Entire home/apt   4   2     2   2   
4   $146.00 Entire home/apt   2   1     1   1   
5   $140.00 Entire home/apt   4   1     1   2   
6   $199.00 Entire home/apt   4   2     1   3   
7   $1,200.00   Entire home/apt8  3     1   4   
9   $135.00 Entire home/apt   8   4     3   4   
11  $119.00 Entire home/apt   2   1     1   1   
12  $55.00  Private room      2   1     1   1

Here is the structure:

'data.frame':   5052 obs. of  7 variables:
 $ price               : num  150 119 151 146 140 199 1200 135 119 55 ...
 $ room_type           : Factor w/ 3 levels "Entire home/apt",..: 1 1 1 1 1 1 1 1 1 2 ...
 $ accommodates        : int  6 4 4 2 4 4 8 8 2 2 ...
 $ bedrooms            : int  2 0 2 1 1 2 3 4 1 1 ...
 $ bathrooms           : num  2 1 2 1 1 1 1 3 1 1 ...
 $ beds                : int  4 2 2 1 2 3 4 4 1 1 ...
 $ review_scores_rating: int  93 96 84 98 95 93 80 100 93 91 ...

Thank you.

We cannot reproduce your result without exactly your data. Please type `dput(head(data, 20))` and paste the result into your question. — G5W, Sep 01 '19 at 00:58
Hi G5W, thank you for your suggestions. The output is a lot when I use dput(head(data, 20)). I'm afraid it is too much to be pasted here. I attach the first 12 observations and the str(data). Do you think it may help? Please let me know. Thanks again. — vcai01, Sep 01 '19 at 03:57
`data` is the dataframe name, but you want to apply this to only one column hence you need to pass `data$price`. — Ronak Shah, Sep 01 '19 at 04:46
Thank you, Ronak. I agree that it causes the problem. What if there are two columns in the data, e.g. Price variable and Sales variable, both contain dollar sign, can I use `gsub("[\\$]", "", data)`? Or do I have to specify data as `data$Price` in one code and `data$Sales` in the other? — vcai01, Sep 01 '19 at 16:31

Shree · Answer 1 · 2019-08-31T23:00:44.010

2

$ is a special character in regex indicating end of string/line. To match literal $ you need to escape it using \\.

x <- "$10,000"

gsub("[\\$,]", "", x)

[1] "10000"

edited Aug 31 '19 at 23:00

answered Aug 31 '19 at 22:55

Shree

10,835
1
14
36

Thank you for your suggestion, Shree. I tried but the output of head(data) still shows the wrong numbers, the same screen shot embedded above. – vcai01 Sep 01 '19 at 00:39
@vcai01 See other answer which clearly shows how to apply above solution to your problem. – Shree Sep 01 '19 at 02:32

score 1 · Accepted Answer · answered Sep 01 '19 at 00:55

I was able to delete the $ and , characters from all of the price values by calling gsub('[\\$,]', '', df$price) on the data frame created as follows:

df <- data.frame(
  price = c('$150.00', '$119.00', '$151.00', '$146.00', '$140.00', '$199.00'),
  room_type = rep('Entire home/apt', 6),
  accommodates = c(6, 4, 4, 2, 4, 4),
  bedrooms = c(2, 0, 2, 1, 1, 2),
  bathrooms = c(2, 1, 2, 1, 1, 1),
  beds = c(4, 2, 2, 1, 2, 3)
)

You can assign the resulting character vector to the price column of your data frame by calling df$price <- gsub('[\\$,]', '', df$price).

Then, if I call head(df) and str(df), I get...

> head(df)
   price       room_type accommodates bedrooms bathrooms beds
1 150.00 Entire home/apt            6        2         2    4
2 119.00 Entire home/apt            4        0         1    2
3 151.00 Entire home/apt            4        2         2    2
4 146.00 Entire home/apt            2        1         1    1
5 140.00 Entire home/apt            4        1         1    2
6 199.00 Entire home/apt            4        2         1    3

> str(df)
'data.frame':   6 obs. of  6 variables:
 $ price       : chr  "150.00" "119.00" "151.00" "146.00" ...
 $ room_type   : Factor w/ 1 level "Entire home/apt": 1 1 1 1 1 1
 $ accommodates: num  6 4 4 2 4 4
 $ bedrooms    : num  2 0 2 1 1 2
 $ bathrooms   : num  2 1 2 1 1 1
 $ beds        : num  4 2 2 1 2 3

If you want the price values to be numerical, you can wrap the gsub() call with as.numeric().

Why does using gsub() in R to delete dollar sign actually change the data?

2 Answers2