Fix characters in large vector

Question

I'm trying to fix characters in a large vector of strings. The characters look like <U\\+[0-9a-fA-F]{4}> (e.g. S<U+00E3>). It's a 358,626-long vector and I provide a random sample of 100 values below.

Expected result:

"Thessalon<U+00ED>ki Thessaloniki Greece" => "Thessaloníki Thessaloniki Greece"
"Phoenix Arizona United States" => "Phoenix Arizona United States"
""  => ""
NA => NA

Luckily, @MrFlick has devised a nice function called trueunicode for converting those back to "normal" characters. However, trueunicode fails for strings which don't contain the pattern. I've tried to work around this by applying trueunicode only to values containg the pattern, like so:

sapply(addresses, function(x) ifelse(grepl("<U\\+[0-9a-fA-F]{4}>", x)), trueunicode(x), x))

Unfortunately, for some reason, trueunicode still fails somewhere, returning:

Error in (function (cp) : too many bits

addresses <- c("San Francisco California United States", "Encinitas CA United States", 
"Malvern Pennsylvania United States", "New York NY United States", 
"Temecula CA United States", "San Francisco CA United States", 
"Pittsburgh Pennsylvania United States", "Istanbul Turkey", "Rochester New York United States", 
"Atlanta GA United States", "Cochin Kerala India", "Sydney New South Wales Australia", 
"Los Angeles CA United States", "Vancouver British Columbia Canada", 
"Rio De Janeiro Rio de Janeiro Brazil", "Washington District of Columbia United States", 
"Seattle Washington United States", "Phoenix Arizona United States", 
"Kwun Tong Kowloon Hong Kong", "Milwaukee Wisconsin United States", 
"Dublin Dublin Ireland", "London England United Kingdom", "Broomfield Colorado United States", 
"Bandung Indonesia", "London England United Kingdom", "Washington United States", 
"Ramat Gan Tel Aviv Israel", "Sydney New South Wales Australia", 
"Houston TX United States", "Salida CO United States", "Bethesda Maryland United States", 
"San Jose California United States", "S<U+00E3>o Gon<U+00E7>alo Rio de Janeiro Brazil", 
"Richmond Virginia United States", "Davao City Davao City Philippines", 
"Bucharest Bucuresti Romania", "Providencia Chile", "Cape Coral Florida United States", 
"Glenrothes Fife United Kingdom", "New York New York United States", 
"Brooklyn NY United States", "New York New York United States", 
"Vienna Wien Austria", "Addison TX United States", "Tel Aviv Tel Aviv Israel", 
"Hilton New York United States", "Tiangu<U+00E1> Ceara Brazil", 
"Hamburg Hamburg Germany", "Thessalon<U+00ED>ki Thessaloniki Greece", 
"New York New York United States", "Vancouver British Columbia Canada", 
"Lagos Lagos Nigeria", "Karachi Sindh Pakistan", "Santa Barbara CA United States", 
"Mumbai Maharashtra India", "Burlington Massachusetts United States", 
"Oslo Oslo Norway", "Jakarta Jakarta Raya Indonesia", "Madrid Madrid Spain", 
"Singapore", "San Mateo California United States", "St. Petersburg Florida United States", 
"Cincinnati Ohio United States", "San Francisco CA United States", 
"Gaithersburg Maryland United States", "Watford Hertford United Kingdom", 
"Austin Texas United States", "Gent Oost-Vlaanderen Belgium", 
"Canton Massachusetts United States", "Berkeley California United States", 
"Carlsbad California United States", "St. Petersburg Florida United States", 
"Bangalore Karnataka India", "Nyon Vaud Switzerland", "Arlington Virginia United States", 
"Palo Alto California United States", "London England United Kingdom", 
"Sydney New South Wales Australia", "Mumbai Maharashtra India", 
"Austin Texas United States", "Larnaca Cyprus", "Melbourn Cambridgeshire United Kingdom", 
"Chicago Illinois United States", "Houston Texas United States", 
"Paris France", "New York New York United States", "Auburn Hills Michigan United States", 
"New Delhi Delhi India", "Bangalore Karnataka India", "Redwood City California United States", 
"Mississauga Ontario Canada", "New York New York United States", 
"Sydney New South Wales Australia", "St Louis MO United States", 
"Rotterdam The Netherlands", "Delta British Columbia Canada", 
"Erlangen Bayern Germany", "Ashburn Virginia United States", 
"Pasadena California United States", "Palo Alto CA United States"
)

EDIT: I've removed mention of Unicode errors, following @PanagiotisKanavos's comment.

These aren't Unicode errors, they are plain-old-ASCII characters. `` is a string with 8 ASCII characters, `<`,`U`,`+`,`0`,`0`,`E`,`3`,`>`. Your data source probably uses this encoding scheme to store anything that can't be represented as ASCII using the angle brackets, `U+` and a Unicode codepoint. Either export the text as actual Unicode text, or find out what kind of encoding was used and reverse it — Panagiotis Kanavos, Jan 17 '17 at 10:04
Actually this encoding has nothing to do with Unicode at all. For example, the Greek city of Salonica is called Θεσσαλονίκη in Greek. And yet, your text contains `Thessalonki ` which is *completely unrelated*. `00ED` is the Latin `í` letter. — Panagiotis Kanavos, Jan 17 '17 at 10:08
Looking at `trueunicode` it seems that this function *breaks* Unicode using the completely incorrect assumption that Windows, a 100% Unicode OS somehow has problems with Unicode. In reality, it's R that doesn't handle Unicode consistently - some methods treat Unicode as wchar strings, others as plain-old char and depend on the system's locale to differentiate between ANSI and UTF8. Now, even the C++ standard doesn't have a UTF8 specific string, although it *does* have UTF16 and UTF32 types. R packages that don't use wchar or allow you to explicitly specify the codepage will fail to read UTF16 — Panagiotis Kanavos, Jan 17 '17 at 10:14
@PanagiotisKanavos That is indeed the way it is spelt in the [source](https://www.crunchbase.com/organization/ergobyte-informatics-s-a#/entity), for some reason. Even if it might be an original typo in this particular instance, I need to reverse the encoding errors and it seems trueunicode does that, except that it sometimes breaks unaccountably. — syre, Jan 17 '17 at 11:43
@PanagiotisKanavos Unfortunately, I can't export the original text again so I have to reverse the errors. — syre, Jan 17 '17 at 11:47
You have to find the code for the converter then. I repeat, these aren't encoding errors. A program mapped the text to ANSI characters that kind-of-looked or **sounded** like the original. For example, the Greek Θ sounds like a soft `th`, as in Thatch. There is *no* such encoding, it's an unofficial phonetic spelling used decades ago when email didn't support Unicode. The Latin `í` on the other hand was *NEVER* used. Someone arbitrarily selected this to map `ί`. — Panagiotis Kanavos, Jan 17 '17 at 11:54
Even the *Portuguese* characters are mapped, even though they *are* part of Latin 1! Looks like whoever wrote the encoder used 7-bit ASCII, probably encoding French and German characters too. — Panagiotis Kanavos, Jan 17 '17 at 11:55
@PanagiotisKanavos Whether we call them encoding errors or not, I need to reverse my data to look like the original text. I'm sorry but I don't understand what is your recommendation. What do you mean by "the code for the converter"? These conversions occurred somewhere along the line when I imported the text to R from the Crunchbase API and then wrote them to a csv file in Windows. — syre, Jan 17 '17 at 12:03
I don't speak `r` but I'd elaborate that too brief `stop("too many bits")`. Make public _raw_ input string, character position etc. — JosefZ, Jan 18 '17 at 00:58

score 0 · Answer 1 · answered Jan 20 '17 at 03:57

Calling the function one value at a time, instead of by vector, did the job.

for (x in 1:length(addresses)) {
 y <- addresses[x]
 addresses[x] <- ifelse(grepl("<U\\+[0-9a-fA-F]{4}>", y), trueunicode(y), y)
}

However, this is more a workaround than a definite solution. Plus I notice weirdness in the results: the pattern "<U\\+[0-9a-fA-F]{4}>" is still visible in dataframe "view" in RStudio, although it is not detected by grepl any more.

Fix characters in large vector

1 Answers1