0

I have a dataset and am trying to clean it step by step. One of the challenges I have is that people write their comments in different ways, so sometimes they show a number in a form of $100K while sometimes they show it in a form of $100000. I am wondering how I can change K to 1000, so instead of $43K we have $43000. The sample data is shown below:

structure(list(comment3 = c("3.22%-1ST $100K/1.15% BAL", "3.25% ON 1ST $100000/1.16% ON BAL", 
"3.22% 1ST 100K/1.16 ON BAL", "3.22% 1ST 100K/1.15% ON BAL", 
"3.26% 1ST 100K/1.16% ON BAL", "3.20% 1ST 100K/1.15% ON BAL", 
"3.22% ON 1ST 100K & 1.15% ON BALANCE")), row.names = c(NA, -7L
), class = c("tbl_df", "tbl", "data.frame"))

1 3.22%-1ST $100K/1.15% BAL           
2 3.25% ON 1ST $100000/1.16% ON BAL   
3 3.22% 1ST 100K/1.16 ON BAL          
4 3.22% 1ST 100K/1.15% ON BAL         
5 3.26% 1ST 100K/1.16% ON BAL         
6 3.20% 1ST 100K/1.15% ON BAL         
7 3.22% ON 1ST 100K & 1.15% ON BALANCE

I tried to use the approach explained here Convert from K to thousand (1000) in R However, I wasn't successful. Here is my code:

as.numeric(sub("\\d\\d\\d[K]","Ke3", data$comment3, fixed=TRUE))

I was hoping that by putting K in [], I can somehow separate it and then used the separated K and multiple it by 1000 which didn't work.

The problem is that since I have a combination of text and number, first I have to select the number and then multiply it by 1000 and replace it which I don't know how to do it. I also have a non-efficient method which is working right now:

bb <- str_match(df_com$comment3, pattern = "\\d\\d\\dK")
table(bb)

by doing this I found that I only have cases like 100K 350K and 110K, so then I replaced these numbers with 100000, 350000 and 110000, but this method is not efficient and kind of stupid! Any comment on how to fix this?

Ross_you
  • 881
  • 5
  • 22

2 Answers2

1

This is a quick and dirty way which matches one or more digits followed by K and appends 000 to it:

data %>% 
  mutate(comment3 = str_replace(comment3 , "(\\d+)K", "\\1000"))

Where your data are placed in `data'.

Using \1 (or \\1 when escaped) to include the contents of the matched group (here, \\d+) seems to be the piece that you were missing in your attempt.

Results from your sample data:

# A tibble: 7 x 1
  comment3                              
  <chr>                                 
1 3.22%-1ST $100000/1.15% BAL           
2 3.25% ON 1ST $100000/1.16% ON BAL     
3 3.22% 1ST 100000/1.16 ON BAL          
4 3.22% 1ST 100000/1.15% ON BAL         
5 3.26% 1ST 100000/1.16% ON BAL         
6 3.20% 1ST 100000/1.15% ON BAL         
7 3.22% ON 1ST 100000 & 1.15% ON BALANCE
semaphorism
  • 836
  • 3
  • 13
  • Thanks, I guess it's working. Can you explain more in detail how this code works? So, you are replacing `(\\d+)` with `1000`? I understand that `\\d+` means, one or more digit, but I am not sure why you used `()` and not `[]`. And why `\\1000` and not `1000` for the replacement? Sorry if it's not clear, it's been just two days since I started working with regular expression – Ross_you Nov 06 '20 at 00:25
  • 1
    `\d` is the metacharacter for a digit (0-9), which we escape as `\\d`. The `+` means to match one or more of the preceding, and the parentheses define the matched group, which is picked up by the `\\1` in the replacement. This will take any series of digits followed by K and replace the 'K' with 000, which is the equivalent of multiplying by 1000. – semaphorism Nov 06 '20 at 00:29
  • Oh, I see, so basically, you find all the matches in the form of `(\\d+)K` and by putting `\d+` in `()` you only select that part as the matched group and then by using `\\1` you pass the match group (here it's `\\d`) to it. correct? – Ross_you Nov 06 '20 at 00:34
  • Exactly. You got it! – semaphorism Nov 06 '20 at 00:35
1

You can also use lookbehind regex in base R :

sub('(?<=\\d)K', '000', df$comment3, perl = TRUE)

#[1] "3.22%-1ST $100000/1.15% BAL"        "3.25% ON 1ST $100000/1.16% ON BAL"
#[3] "3.22% 1ST 100000/1.16 ON BAL"       "3.22% 1ST 100000/1.15% ON BAL"     
#[5] "3.26% 1ST 100000/1.16% ON BAL"      "3.20% 1ST 100000/1.15% ON BAL"       
#[7] "3.22% ON 1ST 100000 & 1.15% ON BALANCE"
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213