R: gsub in a loop to replace names

Question

I have a large dataset (60.000+rows) that contains names. However, the format of writing the names down differs and to enhance data quality I need to recode the names in a single format. Instead of copy pasting the recode-command I would like to do this, for example, in a loop. I have a list of all the wrongly written names, and a list of all the corresponding correctly written names.

So basically, what I want to do is: take name 1 in list1 and replace with name 1 in list2, then take name 2 in list1 and replace with name 2 in list2 etc. Seems not much of a big deal using gsub? But...

I seem to get close, however the output is still not what I want. Does anyone know why or maybe have better solution than what i'm doing now?

EXAMPLE

> dput(list1)
c("Name1", "Name2", "Name3", "Name4", "Name5", "Name6", "Name7", 
"Name8", "Name9", "Name10")
> dput(list2)
c("test1", "test2", "test3", "test4", "test5", "test6", "test7", 
"test8", "test9", "test10")

I've added the print commands to see what is actually happening, it seems to work:

for (i in 1:length(list1)){
  newlist <- gsub(paste0("\\<",list1[i], "\\>"), list2[i], list1)
  print(i)
  print(newlist[i])
}


[1] 1
[1] "test1"
[1] 2
[1] "test2"
[1] 3
[1] "test3"
[1] 4
[1] "test4"
[1] 5
[1] "test5"
[1] 6
[1] "test6"
[1] 7
[1] "test7"
[1] 8
[1] "test8"
[1] 9
[1] "test9"
[1] 10
[1] "test10"

But then when I ask what newlist would look like:

> newlist
 [1] "Name1"  "Name2"  "Name3" 
 [4] "Name4"  "Name5"  "Name6" 
 [7] "Name7"  "Name8"  "Name9" 
[10] "test10"

Also, I have tried using lapply and writing my own function... all didn't work out the way I wanted to :(

This is a common logical issue. `list1` is not changing. You only get `newlist` modified with the last iteration. — Wiktor Stribiżew, Jul 09 '18 at 11:09
Okay... I think i get that. So maybe I should not search for my answer in a loop? Do you know how can I definitely change list1 or newlist? @WiktorStribiżew — Hannie, Jul 09 '18 at 11:11
Probably, it will be cleaner with [`mgsub`](https://www.rdocumentation.org/packages/textclean/versions/0.9.2/topics/mgsub), but you will need to pass `regex=FALSE`. See [this answer](https://stackoverflow.com/questions/33411524/using-mgsub-function-with-word-boundaries-for-replacement-values/33415813#33415813). — Wiktor Stribiżew, Jul 09 '18 at 11:15

score 1 · Accepted Answer · answered Jul 09 '18 at 11:16

1

Define your newlist out of your loop and change only one index at a time in the loop

newlist = list1
for (i in 1:length(list1)){
  newlist[i] <- gsub(paste0("\\<",list1[i], "\\>"), list2[i], list1)[i]
}

answered Jul 09 '18 at 11:16

K.Hua

769
4
20

Great! Can you please explain why you index at the end of the full expression? So the last `[i]`? – Hannie Jul 09 '18 at 11:55
Yes because in your code, gsub takes a vector as an input (list1), so the output will also be a vector of same length. As you only want to change the index i of the newlist, you have to take the same index for the output of gsub. Otherwise you would replace a character by a vector of size length(list1), which would return an error – K.Hua Jul 09 '18 at 13:28

score 1 · Answer 2 · answered Jul 09 '18 at 11:22

You may create regex patterns out of your list1 with sapply(list1, function(x) paste0("\\b",x,"\\b")) and then pass the list of patterns together with the list of replacements into a qdap::mgsub function:

list1 <- c("Name1", "Name2", "Name3", "Name4", "Name5", "Name6", "Name7", "Name8", "Name9", "Name10")
list2 <- c("test1", "test2", "test3", "test4", "test5", "test6", "test7", "test8", "test9", "test10")
regList1 <- sapply(list1, function(x) paste0("\\b",x,"\\b"))
qdap::mgsub(regList1, list2, "Name1 should be different. Name10, too.", fixed=FALSE)
## => [1] "test1 should be different. test10, too."

This solution will work if the items in the list1 character vector are all made up of alphanumeric or _ chars. Else, you will need to also escape the values, and use a PCRE regex the way it is described here.

score 0 · Answer 3 · answered Jul 09 '18 at 11:16

You can do this with mapply.

mapply(function(x, y){
  gsub(paste0("\\<",x, "\\>"), y, x)
}, list1, list2)

   Name1    Name2    Name3    Name4    Name5    Name6    Name7    Name8    Name9   Name10 
 "test1"  "test2"  "test3"  "test4"  "test5"  "test6"  "test7"  "test8"  "test9" "test10"

Wrap unname() around it to get rid of the names.

R: gsub in a loop to replace names

3 Answers3