0

I have a bunch of strings in a column of the data frame. There are multiple typos in the strings so I want to replace them in one line of code. The data frame is shown below:

comment <- c("3.3% 1AT $100000 AND 1.2% BALANCE", "3.3% 1DT $100000 AND 1.2% BALANCE" , 
+ "3.3% 1S $100000 AND 1.2% BALANCE", "3.3% 1SST $100000 AND 1.2% BALANCE", "3.3% 1S T$100000 AND 1.2% BALANCE")
df <- data.frame(comment)

                             comment
1  3.3% 1AT $100000 AND 1.2% BALANCE
2  3.3% 1DT $100000 AND 1.2% BALANCE
3   3.3% 1S $100000 AND 1.2% BALANCE
4 3.3% 1SST $100000 AND 1.2% BALANCE
5  3.3% 1S T$100000 AND 1.2% BALANCE

Basically, what I want to do is to replace 1AT, 1DT, 1S, 1SST with 1ST.

I followed this link Can I use an OR in regex without capturing what's enclosed? and wrote the code below:

gsub("^(\\d\\.\\d%) (?: 1AT|1DT|1S|1SST) (\\$100000) AND (\\d\\.\\d%) BALANCE","\\1 1ST \\2 AND \\3 BALANCE", df$comment)

I understand that this code wouldn't change "1S T$100000" to the right formate, but I expect it to work on other rows. However, this code didn't change the string. Is there any problem with it?

Ross_you
  • 881
  • 5
  • 22
  • 2
    It seems like you're trying to write regex to match the entire string, but you only want to replace a little piece of it. Why not write regex to only match what you want to change? Something like `gsub("(1AT|1DT|1S|1SST)", "1ST", comment)`? – Gregor Thomas Nov 23 '20 at 19:04
  • @GregorThomas ,The reason is that my whole dataset has more than 59000 rows and I am afraid that changing only a portion of it in a form of regex will make some unexpected changes in some rows that are accidentally matched with `(1AT|1DT|1S|1SST)", "1ST", comment)` pattern, that's why first I wanted to try it on a more specific pattern – Ross_you Nov 23 '20 at 19:06
  • 1
    If you do want to debug your regex, I'd strongly recommend regex101.com. [Here's your example coded up](https://regex101.com/r/Dtzkzy/1). I don't think you want a space between `?:` and `1AT` – Gregor Thomas Nov 23 '20 at 19:08
  • 2
    Fair point - but perhaps you've gone too far in the other extreme. Surely `gsub("% (1AT|1DT|1S|1SST) \$", "% 1ST $", comment)` is still much simpler than your whole thing, but extremely unlikely to be a false positive. – Gregor Thomas Nov 23 '20 at 19:10
  • 2
    I'd go with @GregorThomas's suggestion first check to see the values that match the regex he gave using a simple `df[grepl("1AT|1DT|1S|1SST", df$comment), ]` if none of them are unwanted use it – Abdessabour Mtk Nov 23 '20 at 19:11
  • Thanks, @GregorThomas. You mentioned a valid point. BTW, as you mentioned the problem was with space between `?:` and `1AT`, thanks for mentioning it. – Ross_you Nov 23 '20 at 19:13
  • @AbdessabourMtk, thanks, that's a wise note. I didn't know about `grepl` function. – Ross_you Nov 23 '20 at 19:15
  • @AbdessabourMtk, it's kind of interesting, I tried using `grepl` and used `grepl("1AT|1DT|1S|1SST", df$comment)` and it returned 5 `TRUE` while I expected 4. Is there any reason for that? – Ross_you Nov 23 '20 at 19:20
  • 1
    `grepl("(1AT|1DT|1S|1SST)\\s+\$", df$comment)` would do the trick. the 5th line matches the `1S` part of the regex that's why it returns true – Abdessabour Mtk Nov 23 '20 at 19:22
  • 1
    @Roozbeh_you `grepl("(1AT|1DT|1S|1SST)\\s+\\$", df$comment)` – Abdessabour Mtk Nov 23 '20 at 19:32

2 Answers2

2

I would strongly recommend not overcomplicating your regex. Find a happy medium between complexity and specificity. Perhaps something like this:

gsub("% (1AT|1DT|1S|1SST) \$", "% 1ST $", comment)

Unless you've already done a trimws() or similar, I find whitespace issues very common with regex - either multiple spaces or nonstandard whitespace characters. This would be more robust to that:

gsub("%\\s+(1AT|1DT|1S|1SST)\\s+\\$", "% 1ST $", comment)
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
1

Why do you need that whole complication? Why not just use

gsub("^(\\d\\.\\d%)( 1S T| 1SST | 1AT | 1DT )(\\$100000) AND (\\d\\.\\d%) BALANCE","\\1 1ST \\3 AND \\4 BALANCE", df$comment)
[1] "3.3% 1ST $100000 AND 1.2% BALANCE" "3.3% 1ST $100000 AND 1.2% BALANCE"
[3] "3.3% 1S $100000 AND 1.2% BALANCE"  "3.3% 1ST $100000 AND 1.2% BALANCE"
[5] "3.3% 1ST $100000 AND 1.2% BALANCE"

(Gregor's comment is also relevant - I assumed there's a valid reason for you not to want to replaced other instances of 1AT etc. in other contexts. If there isn't such a reason, use his solution)

iod
  • 7,412
  • 2
  • 17
  • 36