These two strings are part of .csv files. They certainly look the same, but they are not. This causes all sorts of problems trying to use dplyr
functions like left_join
, filter
or even base
functions like merge
.
Since they render in the same encoding on webpages like SO, here are the files in a gist.
library(tidyverse)
df1 <- read_csv("df1.csv")
df2 <- read_csv("df2.csv")
word1 <- pull(df1, word1)
word2 <- pull(df2, word2)
# word1 <- "KAIFENG PINGMEI NEW CARBON MATERIALS TECHNOLOGY CO., LTD."
# word2 <- "KAIFENG PINGMEI NEW CARBON MATERIALS TECHNOLOGY CO., LTD."
word1 == word2 # FALSE
dput(word1)
"KAIFENG PINGMEI NEW CARBON MATERIALS TECHNOLOGY CO., LTD."
dput(word2)
"KAIFENG PINGMEI NEW CARBON MATERIALS TECHNOLOGY CO., LTD."
Digging deeper, the spaces are encoded differently.
tibble(char = char1) %>%
mutate(diff = char1 == char2,
enc_1 = Encoding(char1),
enc_2 = Encoding(char2))
# A tibble: 57 × 4
char diff enc_1 enc_2
<chr> <lgl> <chr> <chr>
1 K TRUE unknown unknown
2 A TRUE unknown unknown
3 I TRUE unknown unknown
4 F TRUE unknown unknown
5 E TRUE unknown unknown
6 N TRUE unknown unknown
7 G TRUE unknown unknown
8 FALSE UTF-8 unknown
9 P TRUE unknown unknown
10 I TRUE unknown unknown
# … with 47 more rows
This SO answer provides a downstream workaround. But wouldn't it be great to force a uniform encoding at the read.csv
or readr::read_csv
step? And generally speaking, how can these different space encodings be better handled in R/RStudio/dplyr
?