1

I am using str_replace() to rename thousands of mispelled city names. I'd like to align the replacement argument (i.e. the last argument) to a consistent total number of characters from the left. For example, I'd like to go from:

data %>% 
  mutate(
    city_name = str_replace(city_name, "Torunto", "Toronto"),
    city_name = str_replace(city_name, "Edmoonton", "Edmonton"),
    city_name = str_replace(city_name, "Saskatchawan", "Saskatchewan")
) 

To this:

data %>% 
  mutate(
    city_name = str_replace(city_name, "Torunto",      "Toronto"),
    city_name = str_replace(city_name, "Edmoonton",    "Edmonton"),
    city_name = str_replace(city_name, "Saskatchawan", "Saskatchewan")
) 

Is there any RStudio feature that allows me to do this easily? So far, I have experimented with a reprex search and replace and the 'find and add next' RStudio feature but to no avail.

Phil
  • 7,287
  • 3
  • 36
  • 66
Josh Persi
  • 83
  • 1
  • 7
  • Are you going to have 1000s of rows for mutate? Below answer is suggesting to create a lookup table instead. – zx8754 Mar 30 '22 at 20:43
  • Another solution would be using fuzzy matching, make a dataframe with correct city names then [fuzzy match](https://stackoverflow.com/q/26405895/680068) – zx8754 Mar 30 '22 at 20:46
  • Thanks for the suggestion - I hadn't heard of fuzzy matching until now. My concern with using lookup tables is I would inevitably make an error in the order of city names. I suppose the easiest way to reduce the risk is to keep each vector on a separate script and use the line numbers as a guide. – Josh Persi Mar 30 '22 at 20:58

1 Answers1

1

str_replace_all can take a named vector

library(dplyr)
library(stringr)
data %>%
    mutate(city_name = str_replace_all(city_name, 
     setNames(c("Toronto", "Edmonton","Saskatchewan" ), 
            c("Torunto", "Edmoonton", "Saskatchawan")))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks Akrun, that's an appealing solution. However, I'm not confident I'd be able to keep the two vectors in order since I have thousands of very similarly mispelled names (e.g. Dauglas and Daugles instead of Douglas). If there's some way to reduce the manual effort in maintaining consistent order between the two vectors, I'd be keen to hear it. – Josh Persi Mar 30 '22 at 20:36
  • @JoshPersi there is another option in `fuzzyjoin` to match on stringdist or regex. it depends on the kind of dissimilarity – akrun Mar 30 '22 at 21:28