3

I have a column in a dataset with postcode and cities. I need to clean it up, and leave only the name of the city. The problem is that the postcode comes sometimes before, sometimes after the name of the city:

52064 Aachen 
1000 EA Amsterdam 
6411 EJ Heerlen 
Johannesburg 
Dublin 2 
3600 AA Maarssen 
75591 Paris Cedex 12 
7302 HA Apeldoorn

I'd need to have it cleaned into

Aachen 
Amsterdam 
Heerlen 
Johannesburg 
Dublin
Maarssen 
Paris 
Apeldoorn

Anybody knows how to go about it?

Bond
  • 16,071
  • 6
  • 30
  • 53
  • 1
    `gsub("[0-9]", "", yourData)` to get rid of digits – Rorschach Jul 29 '15 at 21:27
  • What would you do about `75591 Paris Cedex 12`? You probably would want to establish some regex rules or have a database of city names handy – Rich Scriven Jul 29 '15 at 21:29
  • 4
    You may use `library(maps)` to get `world.cities$name` and use that to `grep` or `gsub` to extract the city names. – akrun Jul 29 '15 at 21:33
  • @akrun, thought about that too. It'll work for larger cities, there are about 43k cities in the list. What if there's a village? – Serban Tanasa Jul 29 '15 at 21:39
  • @SerbanTanasa The OP's description is `column in a dataset with postcode and cities`. If you are talking about the `world.cities`, I haven't checked it. – akrun Jul 29 '15 at 21:41
  • @SerbanTanasa If you already have a working example, you can post it as I didn't tried it. – akrun Jul 29 '15 at 21:42

1 Answers1

1

One way is to use gsub, with a souped-up version of the code below:

gsub("^ *| *$","",gsub("[0-9]|[A-Z]{2}|Cedex","",mydata))

[1] "Aachen"       "Amsterdam"    "Heerlen"      "Johannesburg"
 "Dublin"       "Maarssen"     "Paris"        "Apeldoorn"

In English, I'm asking it to first remove digits [0-9] then adding an OR condition | then asking it to take out any instances of two capital letters occuring one after another, another or, then removing specific postal markers like the word Cedex. I'm wrapping it in another gsub to take care of any initial ^ or final $ regular spaces.

Alternatively, you could try @akrun's suggestions of using library(maps) to get world.cities$name (forty-three thousand localities) and pulling some vectorized regex against it, but I'm having trouble with duplicates, i.e. 'York' vs 'New York' in my toy examples.

world.cities$name[(unlist(lapply(world.cities$name, grepl, "52064 Aachen ")))]
[1] "A" "Aachen"
Serban Tanasa
  • 3,592
  • 2
  • 23
  • 45
  • I didn't test it as I was following Richard Scriven's comments. Looks like [this](http://stackoverflow.com/questions/11757013/regular-expressions-for-city-name) may be an option. – akrun Jul 29 '15 at 22:07
  • Thanks Serban, that worked! And thanks for the explanation. – Pavel Kirjanas Jul 30 '15 at 09:39