I, too, have been down the encoding rabbit hole, and one of the important things I learned is that "unknown"
encoding doesn't have to mean it's not UTF-8. Or bad. Or something that needs to be fixed.
Here are some examples:
# Some string that might be UTF-8 or just some ASCII (but created in UTF-8 editor/environment)
ambiguous <- "wat"
Encoding(ambiguous)
#> [1] "unknown"
# Forced coercion to UTF-8 via stringi
ambiguous <- stringi::stri_enc_toutf8("wat", is_unknown_8bit = TRUE)
# Still ambiguous
Encoding(ambiguous)
#> [1] "unknown"
# Some pretty-sure-not-ASCII string
totallygermanic <- "wät"
# It's UTF-8 because that's what my RStudio and every other part of my env is set to
Encoding(totallygermanic)
#> [1] "UTF-8"
# Let's force it to be unknowm
Encoding(totallygermanic) <- "unknown"
# Still prints ok
totallygermanic
#> [1] "wät"
# What's its encoding now?
Encoding(totallygermanic)
#> [1] "unknown"
# Converting it to UTF-8 still prints ok
stringi::stri_enc_toutf8(totallygermanic)
#> [1] "wät"
# So the converted string is UTF-8, right? No.
Encoding(stringi::stri_enc_toutf8(totallygermanic))
#> [1] "unknown"
# Maybe we should just guess?
stringi::stri_enc_detect("wat")
#> [[1]]
#> Encoding Language Confidence
#> 1 ISO-8859-1 en 0.75
#> 2 ISO-8859-2 ro 0.75
#> 3 UTF-8 0.15
stringi::stri_enc_detect("wät")
#> [[1]]
#> Encoding Language Confidence
#> 1 UTF-8 0.8
#> 2 UTF-16BE 0.1
#> 3 UTF-16LE 0.1
#> 4 GB18030 zh 0.1
#> 5 EUC-JP ja 0.1
#> 6 EUC-KR ko 0.1
#> 7 Big5 zh 0.1
Created on 2019-02-11 by the reprex package (v0.2.1)
The takeaway is this: If your string is not obviously non-ASCII, e.g. it only contains letters a-z, it could be ASCII, or it could be UTF-8, so you get an unknown
, but that doesn't have to mean your string is not actually UTF-8, apparently. You may try to forcibly coerce the string, but in the process you might break something that was not broken at all. In my experience, it may be perfectly adequate to use some conversion function like stringi::stri_enc_toutf8
on a variable/vector, test if it prints/works as expected, maybe using a regular expression filter for possibly problematic characters (as a German native we tend to look for äöüß
).
Anway, if you want to dive into the nitty gritty I can recommend looking into the stringi
package and it's encoding functions. This package is the power behind stringr
, which provides a more high-level interface.