0

I have data of city names, London, LONDON, NEW YORK, New York etc.

but I also have data in the form of <c3><U+119B>London, M<c3><U+1193>New York and to make things a little more complicated I have rows with values such as London<c3><U+1193>OL, Sydney<c3><U+0087>NL and London(Westminster), Alicante/ALACANT also having Spanish accents in the data set also Coloma de Cervellò, La Riera de Gaià, Sant Vicen <c3><U+0087> Dels Horts.

So I am just trying to clean this one column.

Can somebody point me in the right direction how I can remove parts of the columns for exaample;

<c3><U+119B>London       to        London
Sydney<c3><U+0087>NL     to        Sydney

Thanks in advance

skr
  • 2,146
  • 19
  • 22
user113156
  • 6,761
  • 5
  • 35
  • 81
  • Have you tried `?gsub` – juan Jul 19 '17 at 16:07
  • Also, check out the `stringi` package for converting non-ASCII characters, eg, as in [this answer](https://stackoverflow.com/a/37254905/5037901). – juan Jul 19 '17 at 16:20

2 Answers2

0

If you have a list of all the cities you expect to find in your dataset, I would do something like this:

goodNames <- c("London", "Alicante", "Sydney")
badNames <- c("London(Westminster)", "Alicante/ALACANT", "SydneyNL")
newNames <- badNames

for (i in c(1:length(goodNames))){
    newNames[grepl(goodNames[i], badNames)] <- goodNames[i]
}

What this does is that it loops through every good city name in the goodNames vector, and checks whether that name can be found within each bad name (eg. "Syndey" is appears within "SydneyNL"). If it does, then the bad name is replaced with the good name. Check out the grep() documentation, there are a lot of useful options like whether or not the matching should be case sensitive.

If you don't have a list of cities with their proper spelling, then you're probably in for a lot of fiddling. Read up on grep(), and the related functions listed in the grep() documentation. If you find it too confusing, the most manual and straightforward approach would be something like this:

df <- data.frame(city=badNames, stringsAsFactors= FALSE)

df$city[df$city == "SydneyNL"] <- "Sydney"
df$city[df$city == "London(Westminster)"] <- "London"
20salmon
  • 31
  • 5
0

Here is one way to think about doing it:

bad <- "<c3><U+119B>London"  
good <- gsub("\\<[^\\]]*\\>", "", bad, perl=TRUE);
good
[1] "London"

This removes all characters between <> includeing both < and >

roarkz
  • 811
  • 10
  • 22