-1

I have a large dataset (60.000+rows) that contains names. However, the format of writing the names down differs and to enhance data quality I need to recode the names in a single format. Instead of copy pasting the recode-command I would like to do this, for example, in a loop. I have a list of all the wrongly written names, and a list of all the corresponding correctly written names.

So basically, what I want to do is: take name 1 in list1 and replace with name 1 in list2, then take name 2 in list1 and replace with name 2 in list2 etc. Seems not much of a big deal using gsub? But...

I seem to get close, however the output is still not what I want. Does anyone know why or maybe have better solution than what i'm doing now?

EXAMPLE

> dput(list1)
c("Name1", "Name2", "Name3", "Name4", "Name5", "Name6", "Name7", 
"Name8", "Name9", "Name10")
> dput(list2)
c("test1", "test2", "test3", "test4", "test5", "test6", "test7", 
"test8", "test9", "test10")

I've added the print commands to see what is actually happening, it seems to work:

for (i in 1:length(list1)){
  newlist <- gsub(paste0("\\<",list1[i], "\\>"), list2[i], list1)
  print(i)
  print(newlist[i])
}


[1] 1
[1] "test1"
[1] 2
[1] "test2"
[1] 3
[1] "test3"
[1] 4
[1] "test4"
[1] 5
[1] "test5"
[1] 6
[1] "test6"
[1] 7
[1] "test7"
[1] 8
[1] "test8"
[1] 9
[1] "test9"
[1] 10
[1] "test10"

But then when I ask what newlist would look like:

> newlist
 [1] "Name1"  "Name2"  "Name3" 
 [4] "Name4"  "Name5"  "Name6" 
 [7] "Name7"  "Name8"  "Name9" 
[10] "test10"

Also, I have tried using lapply and writing my own function... all didn't work out the way I wanted to :(

Hannie
  • 417
  • 5
  • 17
  • 2
    This is a common logical issue. `list1` is not changing. You only get `newlist` modified with the last iteration. – Wiktor Stribiżew Jul 09 '18 at 11:09
  • Okay... I think i get that. So maybe I should not search for my answer in a loop? Do you know how can I definitely change list1 or newlist? @WiktorStribiżew – Hannie Jul 09 '18 at 11:11
  • Probably, it will be cleaner with [`mgsub`](https://www.rdocumentation.org/packages/textclean/versions/0.9.2/topics/mgsub), but you will need to pass `regex=FALSE`. See [this answer](https://stackoverflow.com/questions/33411524/using-mgsub-function-with-word-boundaries-for-replacement-values/33415813#33415813). – Wiktor Stribiżew Jul 09 '18 at 11:15

3 Answers3

1

Define your newlist out of your loop and change only one index at a time in the loop

newlist = list1
for (i in 1:length(list1)){
  newlist[i] <- gsub(paste0("\\<",list1[i], "\\>"), list2[i], list1)[i]
}
K.Hua
  • 769
  • 4
  • 20
  • Great! Can you please explain why you index at the end of the full expression? So the last `[i]`? – Hannie Jul 09 '18 at 11:55
  • Yes because in your code, gsub takes a vector as an input (list1), so the output will also be a vector of same length. As you only want to change the index i of the newlist, you have to take the same index for the output of gsub. Otherwise you would replace a character by a vector of size length(list1), which would return an error – K.Hua Jul 09 '18 at 13:28
1

You may create regex patterns out of your list1 with sapply(list1, function(x) paste0("\\b",x,"\\b")) and then pass the list of patterns together with the list of replacements into a qdap::mgsub function:

list1 <- c("Name1", "Name2", "Name3", "Name4", "Name5", "Name6", "Name7", "Name8", "Name9", "Name10")
list2 <- c("test1", "test2", "test3", "test4", "test5", "test6", "test7", "test8", "test9", "test10")
regList1 <- sapply(list1, function(x) paste0("\\b",x,"\\b"))
qdap::mgsub(regList1, list2, "Name1 should be different. Name10, too.", fixed=FALSE)
## => [1] "test1 should be different. test10, too."

This solution will work if the items in the list1 character vector are all made up of alphanumeric or _ chars. Else, you will need to also escape the values, and use a PCRE regex the way it is described here.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

You can do this with mapply.

mapply(function(x, y){
  gsub(paste0("\\<",x, "\\>"), y, x)
}, list1, list2)

   Name1    Name2    Name3    Name4    Name5    Name6    Name7    Name8    Name9   Name10 
 "test1"  "test2"  "test3"  "test4"  "test5"  "test6"  "test7"  "test8"  "test9" "test10" 

Wrap unname() around it to get rid of the names.

LAP
  • 6,605
  • 2
  • 15
  • 28