Mixed Character Encodings in R: When a space isn't a space

Question

These two strings are part of .csv files. They certainly look the same, but they are not. This causes all sorts of problems trying to use dplyr functions like left_join, filter or even base functions like merge.

Since they render in the same encoding on webpages like SO, here are the files in a gist.

library(tidyverse)
df1 <- read_csv("df1.csv")
df2 <- read_csv("df2.csv")

word1 <- pull(df1, word1)
word2 <- pull(df2, word2)

# word1 <- "KAIFENG PINGMEI NEW CARBON MATERIALS TECHNOLOGY CO., LTD."
# word2 <- "KAIFENG PINGMEI NEW CARBON MATERIALS TECHNOLOGY CO., LTD."
word1 == word2 # FALSE

dput(word1)
"KAIFENG PINGMEI NEW CARBON MATERIALS TECHNOLOGY CO., LTD."
dput(word2)
"KAIFENG PINGMEI NEW CARBON MATERIALS TECHNOLOGY CO., LTD."

Digging deeper, the spaces are encoded differently.

tibble(char = char1) %>%
  mutate(diff = char1 == char2,
         enc_1 = Encoding(char1),
         enc_2 = Encoding(char2))

# A tibble: 57 × 4
   char  diff  enc_1   enc_2  
   <chr> <lgl> <chr>   <chr>  
 1 K     TRUE  unknown unknown
 2 A     TRUE  unknown unknown
 3 I     TRUE  unknown unknown
 4 F     TRUE  unknown unknown
 5 E     TRUE  unknown unknown
 6 N     TRUE  unknown unknown
 7 G     TRUE  unknown unknown
 8       FALSE UTF-8   unknown
 9 P     TRUE  unknown unknown
10 I     TRUE  unknown unknown
# … with 47 more rows

This SO answer provides a downstream workaround. But wouldn't it be great to force a uniform encoding at the read.csv or readr::read_csv step? And generally speaking, how can these different space encodings be better handled in R/RStudio/dplyr?

I'm not sure how you can solve that, but I couldn't reproduce your error, the two words are equal in my computer. — Maël, Mar 08 '22 at 16:50
would this help https://stackoverflow.com/a/69181967/4083743 — user63230, Mar 08 '22 at 16:54
@user63230 unfortunately, this doesn't solve the issue, the UTF-8 encoding doesn't stick for some reason. — Jeff Parker, Mar 08 '22 at 17:07
I also can't replicate the issue with the example you provided. The problem usually is that data is not imported correctly. You need to know what encoding was used in the file you are reading. There's not something that can be determined from the file itself. Windows usually uses LATIN-1 as the default encoding but the Mac and Linux use UTF-8. If you are working across operating systems you just need to be careful. Those different space characters may have different meanings is specific contexts. Transforming them all to a an ASCII space might drop important information — MrFlick, Mar 08 '22 at 18:51

Mixed Character Encodings in R: When a space isn't a space

0 Answers0