0

In R, I have vectors like this:

TEST <- c("BlAA¶schl, G", "ThAA¶ni, A.")

whereby BlAA¶schl schould be Blöschl, and ThAA¶ni should be Thöni.

There are similar problems throughout a whole dataset. I don't know how it is termed (maybe "non-ASCII characters"?).

Based on this response, others seem to have tried this code successfully:

Encoding(TEST) <- 'latin1'
stringi::stri_trans_general(TEST, 'Latin-ASCII')

But in my case, nothing changes.

What can I do to convert characters like AA¶ to ö?


EDIT: The key problem, it seems, is that there is a "double mojibake" as JosefZ mentioned in the comments.

EDIT 2: I found this "UTF-8 Character Debug Tool" which contains some (not all) of the problems in the actual and expected columns. In addition, this "encoding repairer" on GitHub seems to offer what I need, but it is not written in R.

anpami
  • 760
  • 5
  • 17
  • this works fine on Linux with UTF-8... what is your encoding? check Sys.getenv() output? – user12256545 Mar 15 '21 at 15:28
  • @user12256545, interesting, thank you. I use Windows. Where I can find my "encoding" in `Sys.getenv()`? – anpami Mar 15 '21 at 15:36
  • You can't as it looks like a _double_ [mojibake](https://en.wikipedia.org/wiki/Mojibake) case (moreover, somewhat mangled), cf. the following _Python_ example: `'ö'.encode('utf-8').decode('latin1').encode('utf-8').decode('latin1')` returns `'Ã\x83¶'` – JosefZ Mar 15 '21 at 16:44
  • Thank you for the explanation, @JosefZ! But if it is impossible, how comes that the other user (in the comments here) was able to do it in Linux? – anpami Mar 15 '21 at 18:58
  • This works for me starting from `TEST <- c("Bl\xf6schl, G", "Thöni, A.")`. I'm on Windows where `Sys.getlocale(category = "LC_ALL")` returns `[1] "LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252"`.and `stringi::stri_escape_unicode( TEST)` returns `[1] "Bl\\u00f6schl, G" "Th\\u00f6ni, A."` – JosefZ Mar 15 '21 at 22:18
  • I have the same result, @JosefZ, but it does not work with `BlAA¶schl` as a starting point. So I suppose it really does indicate a "double mojibake" that cannot be rescued? By the way, my `Sys.getlocale(category="LC_ALL")` returns: `LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252`. – anpami Mar 16 '21 at 08:59

1 Answers1

0

There might be better, more efficient & automated solutions.

But I tried it manually: I looked at all "mojibakes" and changed them with gsub manually:

TEST <- c("BlAA¶schl, G", "ThAA¶ni, A.")

TEST <- gsub("ö", "ö", TEST)
TEST <- gsub("ü", "ü", TEST)
TEST <- gsub("ž", "z", TEST)
TEST <- gsub("á", "á", TEST)
TEST <- gsub("ä", "ä", TEST)
TEST <- gsub("ć", "ć", TEST)
TEST <- gsub("Ã", "Á", TEST)
TEST <- gsub("ß", "ß", TEST)
TEST <- gsub("ã", "ã", TEST)
TEST <- gsub("é", "é", TEST)
TEST <- gsub("Ä", "č", TEST)

It works, but there's always the risk of omitting some characters if the dataset is too large.

anpami
  • 760
  • 5
  • 17