String replacement with multiple options in R

Question

My problem: I am called on to compare pesticide lists that can be anywhere from 100 to 500 compounds long. I have no problem importing and spreading them but, if the names do not match, the columns do not align. Naming is creative sport in the pesticide world. Endosulfan or Endosulphan. op-DDT or DDT (o,p).

My view was that if I created a dictionary in Excel where the first column held a preferred name (pref) and the columns to the right held alternatives (up to five) I could run through the pesticide list to standardise the naming before spreading it and then get alignment.

I tried creating a string of the alternatives, omitting the empty fields and then using sub to do the replacement.

For example, I set my preferred name as

pref <- "HCH-gamma (Lindane)"

and a string of alternatives as

check_list <- "BHC-gamma (Lindane)|BHC - gamma (Lindane)|Lindane"

and then ran a loop through a df of names with

Combined$Name[i] <- sub(check_list, pref, Combined$Name[i])

What started out as name <- c("HCH-gamma (Lindane)","BHC-gamma (Lindane)","BHC - gamma (lindane)","Lindane")

should end up as

name <- c("HCH-gamma (Lindane)","HCH-gamma (Lindane)","HCH-gamma (Lindane)","HCH-gamma (Lindane"))

But didn't. The results were weird, such as

"BHC - gamma (HCH-gamma (Lindane))";

Clearly I do not have the grammar correct but it is the first time I have tried string manipulation like this and cannot fathom out what I am doing wrong. Any guidance would be appreciated. Or is there a better way to do it?

You should escape the `(` and `)` - `check_list <- "BHC-gamma \$Lindane\$|BHC - gamma \$Lindane\$|Lindane"`. Also, since you have `Lindane` as an alternative, it will be replaced with the `pref`. So, the result is expected. Maybe you need to remove `|Lindane`? Or what is the rule here? Please also provide the input data (`Combined$Name[i]`) to test against together with the expected output. — Wiktor Stribiżew, Mar 23 '17 at 10:40
No idea what you mean, please add necessary details to the question. — Wiktor Stribiżew, Mar 23 '17 at 11:18
@Wiktor Sorry, I haven't forgotten you but timezones (Australia) and a need to take my wife to hospital intervened. Once I return from collecting her tonight I will provide more details. I do appreciate your patience and helpfulness. — Lee_Kennedy, Mar 24 '17 at 04:45
It is ok, just the comment about the four name list is rather unclear. Hope your wife is fine now. — Wiktor Stribiżew, Mar 24 '17 at 07:33
@Wiktor She's sleeping. A bit of a heart scare. The comment left early (note to self:ENTER key ≠ CR) I didn't have time to retrieve or change. I have enlarged the question to better explain what and why I am attempting to do. — Lee_Kennedy, Mar 24 '17 at 20:28
Ok, you have lists of what to search for and what to replace with, 1 to 1, right? You read them from the external source and not hard-code into your R program, right? — Wiktor Stribiżew, Mar 24 '17 at 21:13
That's right, otherwise it becomes too labor intensive. Ideally (perhaps too simply) I imagine it as a dictionary that can be drawn on to standardise the naming. — Lee_Kennedy, Mar 24 '17 at 21:37
Have you checked the approaches in [Replace multiple arguments with gsub](http://stackoverflow.com/questions/15253954/replace-multiple-arguments-with-gsub)? — Wiktor Stribiżew, Mar 24 '17 at 21:56
Used a variation of Andrew Mackenzie's answer at http://stackoverflow.com/questions/15253954/replace-multiple-arguments-with-gsub Thank you for you help. Lee. — Lee_Kennedy, Mar 25 '17 at 05:46

score 0 · Answer 1 · answered Mar 23 '17 at 11:43

0

Kudos to apom for this.

ifelse(grepl(searchTerm, myVector), newTerm, myVector)

answered Mar 23 '17 at 11:43

Clarius

1,183
10
10

String replacement with multiple options in R

1 Answers1