1

I have a city column of 25000 rows with lot of misspelled cities in a data frame. The sample looks like below

Vishakapatnam, a.p
Vishakapatnam URBAN
Vishakapatnam Distt.
Vishakapatnam
Vishakapatnam
vghjfg"
vgfsgsvsw
Vellore
Vellore
VELLORE
VELLORE
New deklhi
New Dehli
new dehli
NEW DEHI
xxxx

zz
a
1234
5644
3

The data contains city with different spelling, numeric, spaces and some random alphabets. I want change the misspelled cities into one name and remove spaces, alphabets with no meaning and numeric. I am trying to do with grep as mentioned in some of the answers here but it is so tedious. Also, I tried with TM package but I could not achieve this. Could some one please share any method which we can do this more efficiently.

ssan
  • 301
  • 1
  • 9
  • 2
    Do you have a list / data set defining *correctly* spelled city names? – nrussell Jul 22 '16 at 11:53
  • 2
    What is the criteria for "alphabets with no meaning"? – hrbrmstr Jul 22 '16 at 11:54
  • 2
    You can remove spaces by `gsub` (e.g. gsub(" ", "", "New Dehli")`. To identify misspelled words I would calculate the Jaccard index (https://en.wikipedia.org/wiki/Jaccard_index) or a similar measurments. – Qaswed Jul 22 '16 at 12:05
  • There is no data which tells correct city name. I have to create it manually. And those alphabets are some random one. They do not have any criteria. – ssan Jul 22 '16 at 12:08
  • 2
    Have a look at: ``maps::world.cities`` which provides a list of 44000 cities. A quick look at it shows that at least *Visakhapatnam* is included. As a start you could try: ``which(cities %in% maps::world.cities)``. Maybe that does help a bit. – Phann Jul 22 '16 at 12:28
  • 2
    Adding to the suggestion by @Qaswed you may want to look into "fuzzy string matching", http://bigdata-doctor.com/fuzzy-string-matching-survival-skill-tackle-unstructured-information-r/ – Jean V. Adams Jul 22 '16 at 13:22
  • 1
    Building on @Phann's resource, take a look at this SO question: http://stackoverflow.com/questions/29360262/find-match-of-two-data-frames-and-rewrite-the-answer-as-data-frame. You could perhaps use its methodology to match your data frame of names to the world list. – lawyeR Jul 22 '16 at 16:24

0 Answers0