R: Convert wrong display of foreign characters into a correct encoding (double mojibake)

Question

In R, I have vectors like this:

TEST <- c("BlAA¶schl, G", "ThAA¶ni, A.")

whereby BlAA¶schl schould be Blöschl, and ThAA¶ni should be Thöni.

There are similar problems throughout a whole dataset. I don't know how it is termed (maybe "non-ASCII characters"?).

Based on this response, others seem to have tried this code successfully:

Encoding(TEST) <- 'latin1'
stringi::stri_trans_general(TEST, 'Latin-ASCII')

But in my case, nothing changes.

What can I do to convert characters like AA¶ to ö?

EDIT: The key problem, it seems, is that there is a "double mojibake" as JosefZ mentioned in the comments.

EDIT 2: I found this "UTF-8 Character Debug Tool" which contains some (not all) of the problems in the actual and expected columns. In addition, this "encoding repairer" on GitHub seems to offer what I need, but it is not written in R.

this works fine on Linux with UTF-8... what is your encoding? check Sys.getenv() output? — user12256545, Mar 15 '21 at 15:28
@user12256545, interesting, thank you. I use Windows. Where I can find my "encoding" in `Sys.getenv()`? — anpami, Mar 15 '21 at 15:36
You can't as it looks like a _double_ [mojibake](https://en.wikipedia.org/wiki/Mojibake) case (moreover, somewhat mangled), cf. the following _Python_ example: `'ö'.encode('utf-8').decode('latin1').encode('utf-8').decode('latin1')` returns `'Ã\x83Â¶'` — JosefZ, Mar 15 '21 at 16:44
Thank you for the explanation, @JosefZ! But if it is impossible, how comes that the other user (in the comments here) was able to do it in Linux? — anpami, Mar 15 '21 at 18:58
This works for me starting from `TEST <- c("Bl\xf6schl, G", "Thöni, A.")`. I'm on Windows where `Sys.getlocale(category = "LC_ALL")` returns `[1] "LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252"`.and `stringi::stri_escape_unicode( TEST)` returns `[1] "Bl\\u00f6schl, G" "Th\\u00f6ni, A."` — JosefZ, Mar 15 '21 at 22:18
I have the same result, @JosefZ, but it does not work with `BlAA¶schl` as a starting point. So I suppose it really does indicate a "double mojibake" that cannot be rescued? By the way, my `Sys.getlocale(category="LC_ALL")` returns: `LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252`. — anpami, Mar 16 '21 at 08:59

anpami · Accepted Answer · 2021-03-16T11:44:20.973

There might be better, more efficient & automated solutions.

But I tried it manually: I looked at all "mojibakes" and changed them with gsub manually:

TEST <- c("BlAA¶schl, G", "ThAA¶ni, A.")

TEST <- gsub("Ã¶", "ö", TEST)
TEST <- gsub("Ã¼", "ü", TEST)
TEST <- gsub("Å¾", "z", TEST)
TEST <- gsub("Ã¡", "á", TEST)
TEST <- gsub("Ã¤", "ä", TEST)
TEST <- gsub("Ä‡", "ć", TEST)
TEST <- gsub("Ã", "Á", TEST)
TEST <- gsub("ÃŸ", "ß", TEST)
TEST <- gsub("Ã£", "ã", TEST)
TEST <- gsub("Ã©", "é", TEST)
TEST <- gsub("Ä", "č", TEST)

It works, but there's always the risk of omitting some characters if the dataset is too large.

R: Convert wrong display of foreign characters into a correct encoding (double mojibake)

1 Answers1