I'm trying to fix characters in a large vector of strings. The characters look like <U\\+[0-9a-fA-F]{4}>
(e.g. S<U+00E3>
). It's a 358,626-long vector and I provide a random sample of 100 values below.
Expected result:
"Thessalon<U+00ED>ki Thessaloniki Greece" => "Thessaloníki Thessaloniki Greece"
"Phoenix Arizona United States" => "Phoenix Arizona United States"
"" => ""
NA => NA
Luckily, @MrFlick has devised a nice function called trueunicode for converting those back to "normal" characters. However, trueunicode fails for strings which don't contain the pattern. I've tried to work around this by applying trueunicode only to values containg the pattern, like so:
sapply(addresses, function(x) ifelse(grepl("<U\\+[0-9a-fA-F]{4}>", x)), trueunicode(x), x))
Unfortunately, for some reason, trueunicode still fails somewhere, returning:
Error in (function (cp) : too many bits
addresses <- c("San Francisco California United States", "Encinitas CA United States",
"Malvern Pennsylvania United States", "New York NY United States",
"Temecula CA United States", "San Francisco CA United States",
"Pittsburgh Pennsylvania United States", "Istanbul Turkey", "Rochester New York United States",
"Atlanta GA United States", "Cochin Kerala India", "Sydney New South Wales Australia",
"Los Angeles CA United States", "Vancouver British Columbia Canada",
"Rio De Janeiro Rio de Janeiro Brazil", "Washington District of Columbia United States",
"Seattle Washington United States", "Phoenix Arizona United States",
"Kwun Tong Kowloon Hong Kong", "Milwaukee Wisconsin United States",
"Dublin Dublin Ireland", "London England United Kingdom", "Broomfield Colorado United States",
"Bandung Indonesia", "London England United Kingdom", "Washington United States",
"Ramat Gan Tel Aviv Israel", "Sydney New South Wales Australia",
"Houston TX United States", "Salida CO United States", "Bethesda Maryland United States",
"San Jose California United States", "S<U+00E3>o Gon<U+00E7>alo Rio de Janeiro Brazil",
"Richmond Virginia United States", "Davao City Davao City Philippines",
"Bucharest Bucuresti Romania", "Providencia Chile", "Cape Coral Florida United States",
"Glenrothes Fife United Kingdom", "New York New York United States",
"Brooklyn NY United States", "New York New York United States",
"Vienna Wien Austria", "Addison TX United States", "Tel Aviv Tel Aviv Israel",
"Hilton New York United States", "Tiangu<U+00E1> Ceara Brazil",
"Hamburg Hamburg Germany", "Thessalon<U+00ED>ki Thessaloniki Greece",
"New York New York United States", "Vancouver British Columbia Canada",
"Lagos Lagos Nigeria", "Karachi Sindh Pakistan", "Santa Barbara CA United States",
"Mumbai Maharashtra India", "Burlington Massachusetts United States",
"Oslo Oslo Norway", "Jakarta Jakarta Raya Indonesia", "Madrid Madrid Spain",
"Singapore", "San Mateo California United States", "St. Petersburg Florida United States",
"Cincinnati Ohio United States", "San Francisco CA United States",
"Gaithersburg Maryland United States", "Watford Hertford United Kingdom",
"Austin Texas United States", "Gent Oost-Vlaanderen Belgium",
"Canton Massachusetts United States", "Berkeley California United States",
"Carlsbad California United States", "St. Petersburg Florida United States",
"Bangalore Karnataka India", "Nyon Vaud Switzerland", "Arlington Virginia United States",
"Palo Alto California United States", "London England United Kingdom",
"Sydney New South Wales Australia", "Mumbai Maharashtra India",
"Austin Texas United States", "Larnaca Cyprus", "Melbourn Cambridgeshire United Kingdom",
"Chicago Illinois United States", "Houston Texas United States",
"Paris France", "New York New York United States", "Auburn Hills Michigan United States",
"New Delhi Delhi India", "Bangalore Karnataka India", "Redwood City California United States",
"Mississauga Ontario Canada", "New York New York United States",
"Sydney New South Wales Australia", "St Louis MO United States",
"Rotterdam The Netherlands", "Delta British Columbia Canada",
"Erlangen Bayern Germany", "Ashburn Virginia United States",
"Pasadena California United States", "Palo Alto CA United States"
)
EDIT: I've removed mention of Unicode errors, following @PanagiotisKanavos's comment.