1

I am working on a column in a data set. This column has over 500K rows, with 1500 unique entries. These entries describe different cities, but many are misspelled.

What I want to do is identify all the rows that are similar to each other (e.g. New York City vs. New York Vs. New Yark) and then change the entire content of this row to "New York City".

Are there commands that can identify similar strings and then replace their contents?

Karolis Koncevičius
  • 9,417
  • 9
  • 56
  • 89
Jack
  • 13
  • 2
  • 3
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Mar 30 '20 at 19:36
  • Possible duplicate of something like this: https://stackoverflow.com/questions/25752306/r-function-to-correct-words-by-frequency-of-more-proximate-word – MrFlick Mar 30 '20 at 19:37
  • General recommendations here: https://stackoverflow.com/questions/3515235/find-closest-match-for-misspelled-city-names – MrFlick Mar 30 '20 at 19:38

1 Answers1

0

You can try using approximate "grep" - agrep():

Start with some example cities:

cities <- c("New Yark", "New York City", "Nevada", "California", "new york", "New Amsterdam", "Naw York")

Get names that approximately match "New York" (you will have to play with max.dist parameter)

inds <- agrep("New York", cities, max.dist=0.2)

Here are the ones that matched in this case:

cities[inds]
[1] "New Yark"      "New York City" "new york"      "Naw York"

Let's change them:

cities[inds] <- "New York"
cities
[1] "New York"      "New York"      "Nevada"        "California"    "New York"      "New Amsterdam" "New York"
Karolis Koncevičius
  • 9,417
  • 9
  • 56
  • 89