2

I have a web response being returned in raw format which I'm unable to properly encode. It contains the following values:

ef bc 86

The character is meant to be a Fullwidth Ampersand (to illustrate below):

> as.character("\uFF06")
[1] "&"
> charToRaw("\uFF02")
[1] ef bc 82

However, no matter what I've tried it gets converted to ". To illustrate:

> rawToChar(charToRaw("\uFF02")) 
[1] """

Because of the equivalence of the raw values, I don't think there's anything I can do in my web call to influence the problem I'm having (happy to be corrected). I believe I need to work out how to properly do the character encoding.

I also took an extreme approach of trying all other encodings as follows but none converted to the fullwidth ampersand:

> x_raw <- charToRaw("\uFF02")
> x_raw
[1] ef bc 82
> sapply(
+     stringi::stri_enc_list()
+     ,function(encoding) stringi::stri_encode(str = x_raw, encoding)
+ ) |> # R's new native pipe
+     tibble::enframe(name = "encoding") 
# A tibble: 1,203 x 2
   encoding value          
   <chr>    <chr>          
 1 037      "Õ¯b"          
 2 273      "Õ¯b"          
 3 277      "Õ¯b"          
 4 278      "Õ¯b"          
 5 280      "Õ¯b"          
 6 284      "Õ¯b"          
 7 285      "Õ~b"          
 8 297      "Õ¯b"          
 9 420      "\u001a\u001ab"
10 424      "\u001a\u001ab"
# ... with 1,193 more rows

My work around at the moment is to replace the strings after the encoding, but this character is just one example of many, and hard-coding every instance doesn't seem practical.

> rawToChar(x_raw)
[1] """
> stringr::str_replace_all(rawToChar(x_raw), c(""" = "\uFF06"))
[1] "&"

The substitution workaround is also complicated that I've also got characters like the HYPHEN (not HYPEN-MINUS) somehow getting converted where the last to raw values are getting converted to a string with what appears to be octal values:

> as.character("\u2010") # HYPHEN
[1] "‐"
> as.character("\u2010") |> charToRaw() # As raw
[1] e2 80 90
> as.character("\u2010") |> charToRaw() |> rawToChar() # Converted back to string
[1] "â€\u0090"
> charToRaw("â\200\220") # string with equivalent raw
[1] e2 80 90

Any help appreciated.

1 Answers1

2

I'm not totally clear on exactly what you are trying to do, but the problem with getting back your original character is that R cannot determine the encoding automatically from the raw bytes. I assume you are on Windows. If you do

val <- rawToChar(charToRaw("\uFF06")) 
val
# [1] "&"
Encoding(val)
# [1] "unknown"
Encoding(val) <- "UTF-8"
val
# [1] "&"

Just make sure to set the encoding properly.

MrFlick
  • 195,160
  • 17
  • 277
  • 295