R String Encoding from "unknown"/"ASCII" to "UTF-8"

Question

I'm not really sure how to make this into a reproducible example, and for that I apologize. But I have a data frame with a string column. When I run stri_enc_mark on the column, I see that I have both 'ASCII' and 'UTF-8' encoded strings. This is an issue because when I try to upload this data into an elastic search database, then I run into the following error:

"Invalid UTF-8 start byte 0xa0\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@40d00701; line: 1, column: 1425]"

I'm assuming this is because of the ASCII encoded strings. I tried to use write.csv(... fileEncoding = 'UTF-8') but when I load up that CSV the string column still has a mix of encodings. Neither Encoding(x) <- 'UTF-8', stri_enc_toutf8, nor stri_encode seem to help out with the conversion.

Any advice or guidance would be awesome.

ASCII characters are a subset of UTF-8 characters so it's unlikely those are causing a problem. What code are you running exactly that gives the error. Without some sort of [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) it's going to be nearly impossible to help. — MrFlick, Jun 13 '18 at 18:19
I don't know how to reproduce text encodings. Even if I make a variable `x <- 'hello world'` the encoding turns out to be 'unknown' and when I try to use `iconv(x, 'unknown', 'UTF-8')` or even `stri_enc_toutf8(x)` nothing changes — struggles, Jun 13 '18 at 18:36
The encoding will be unknown because there are no non-ascii characters in there so the encoding doesn't really matter. A function like `charToRaw()` can output the raw bytes of a string. — MrFlick, Jun 13 '18 at 18:38
That actually solves the issue. Thank you so much! @MrFlick!!!! — struggles, Jun 13 '18 at 19:11

score 2 · Accepted Answer · answered Jun 13 '18 at 19:16

Thanks to @MrFlick I was able to solve the problem. Essentially, given a data frame with character columns of mixed encodings, the easiest work around was to:

df %>%
  mutate_if(is.character, function(x){
    x %>%
      sapply(function(y){
        y %>%
          charToRaw %>%
          rawToChar
      })
   })

This makes sure that all the characters are encoded in the same native encoding. This solves the issue where I was unable to load the data into elastic search due to encoding inconsistencies.

R String Encoding from "unknown"/"ASCII" to "UTF-8"

1 Answers1