How to use OR to choose between multiple matches for string manipulation in regex

Question

I have a bunch of strings in a column of the data frame. There are multiple typos in the strings so I want to replace them in one line of code. The data frame is shown below:

comment <- c("3.3% 1AT $100000 AND 1.2% BALANCE", "3.3% 1DT $100000 AND 1.2% BALANCE" , 
+ "3.3% 1S $100000 AND 1.2% BALANCE", "3.3% 1SST $100000 AND 1.2% BALANCE", "3.3% 1S T$100000 AND 1.2% BALANCE")
df <- data.frame(comment)

                             comment
1  3.3% 1AT $100000 AND 1.2% BALANCE
2  3.3% 1DT $100000 AND 1.2% BALANCE
3   3.3% 1S $100000 AND 1.2% BALANCE
4 3.3% 1SST $100000 AND 1.2% BALANCE
5  3.3% 1S T$100000 AND 1.2% BALANCE

Basically, what I want to do is to replace 1AT, 1DT, 1S, 1SST with 1ST.

I followed this link Can I use an OR in regex without capturing what's enclosed? and wrote the code below:

gsub("^(\\d\\.\\d%) (?: 1AT|1DT|1S|1SST) (\\$100000) AND (\\d\\.\\d%) BALANCE","\\1 1ST \\2 AND \\3 BALANCE", df$comment)

I understand that this code wouldn't change "1S T$100000" to the right formate, but I expect it to work on other rows. However, this code didn't change the string. Is there any problem with it?

It seems like you're trying to write regex to match the entire string, but you only want to replace a little piece of it. Why not write regex to only match what you want to change? Something like `gsub("(1AT|1DT|1S|1SST)", "1ST", comment)`? — Gregor Thomas, Nov 23 '20 at 19:04
@GregorThomas ,The reason is that my whole dataset has more than 59000 rows and I am afraid that changing only a portion of it in a form of regex will make some unexpected changes in some rows that are accidentally matched with `(1AT|1DT|1S|1SST)", "1ST", comment)` pattern, that's why first I wanted to try it on a more specific pattern — Ross_you, Nov 23 '20 at 19:06
If you do want to debug your regex, I'd strongly recommend regex101.com. [Here's your example coded up](https://regex101.com/r/Dtzkzy/1). I don't think you want a space between `?:` and `1AT` — Gregor Thomas, Nov 23 '20 at 19:08
Fair point - but perhaps you've gone too far in the other extreme. Surely `gsub("% (1AT|1DT|1S|1SST) \$", "% 1ST $", comment)` is still much simpler than your whole thing, but extremely unlikely to be a false positive. — Gregor Thomas, Nov 23 '20 at 19:10
I'd go with @GregorThomas's suggestion first check to see the values that match the regex he gave using a simple `df[grepl("1AT|1DT|1S|1SST", df$comment), ]` if none of them are unwanted use it — Abdessabour Mtk, Nov 23 '20 at 19:11
Thanks, @GregorThomas. You mentioned a valid point. BTW, as you mentioned the problem was with space between `?:` and `1AT`, thanks for mentioning it. — Ross_you, Nov 23 '20 at 19:13
@AbdessabourMtk, thanks, that's a wise note. I didn't know about `grepl` function. — Ross_you, Nov 23 '20 at 19:15
@AbdessabourMtk, it's kind of interesting, I tried using `grepl` and used `grepl("1AT|1DT|1S|1SST", df$comment)` and it returned 5 `TRUE` while I expected 4. Is there any reason for that? — Ross_you, Nov 23 '20 at 19:20
`grepl("(1AT|1DT|1S|1SST)\\s+\$", df$comment)` would do the trick. the 5th line matches the `1S` part of the regex that's why it returns true — Abdessabour Mtk, Nov 23 '20 at 19:22
@Roozbeh_you `grepl("(1AT|1DT|1S|1SST)\\s+\\$", df$comment)` — Abdessabour Mtk, Nov 23 '20 at 19:32

Gregor Thomas · Accepted Answer · 2020-11-23T19:38:30.933

2

I would strongly recommend not overcomplicating your regex. Find a happy medium between complexity and specificity. Perhaps something like this:

gsub("% (1AT|1DT|1S|1SST) \$", "% 1ST $", comment)

Unless you've already done a trimws() or similar, I find whitespace issues very common with regex - either multiple spaces or nonstandard whitespace characters. This would be more robust to that:

gsub("%\\s+(1AT|1DT|1S|1SST)\\s+\\$", "% 1ST $", comment)

edited Nov 23 '20 at 19:38

answered Nov 23 '20 at 19:14

Gregor Thomas

136,190
20
167
294

1

Why I try this I get `Error: '\$' is an unrecognized escape in character string starting ""% (1AT|1DT|1S|1SST) \$"` - any idea why? – iod Nov 23 '20 at 19:24
2

use `\\$` instead of `\$`, in R you need to double escape – Abdessabour Mtk Nov 23 '20 at 19:31
Sorry - forget to add the second \ coming from regex101 – Gregor Thomas Nov 23 '20 at 19:40

iod · Answer 2 · 2020-11-23T19:21:02.583

1

Why do you need that whole complication? Why not just use

gsub("^(\\d\\.\\d%)( 1S T| 1SST | 1AT | 1DT )(\\$100000) AND (\\d\\.\\d%) BALANCE","\\1 1ST \\3 AND \\4 BALANCE", df$comment)
[1] "3.3% 1ST $100000 AND 1.2% BALANCE" "3.3% 1ST $100000 AND 1.2% BALANCE"
[3] "3.3% 1S $100000 AND 1.2% BALANCE"  "3.3% 1ST $100000 AND 1.2% BALANCE"
[5] "3.3% 1ST $100000 AND 1.2% BALANCE"

(Gregor's comment is also relevant - I assumed there's a valid reason for you not to want to replaced other instances of 1AT etc. in other contexts. If there isn't such a reason, use his solution)

edited Nov 23 '20 at 19:21

answered Nov 23 '20 at 19:05

iod

7,412
2
17
36

Thanks, this solution also works. Is there any way to have 1 space gap between `1ST` and `AND` and `BALANCE? currently, we have two spaces – Ross_you Nov 23 '20 at 19:14
Fixed to address the spaces – iod Nov 23 '20 at 19:24

How to use OR to choose between multiple matches for string manipulation in regex

2 Answers2