readr::read_csv issue: Chinese Character becomes messy codes

Question

I'm trying to import a dataset to RStudio, however I am stuck with Chinese characters, as they become messy codes. Here is the code:

library(tidyverse)
df <- read_csv("中文,英文\n英文,德文")
df
# A tibble: 1 x 2
  `\xd6\xd0\xce\xc4`            `Ӣ\xce\xc4`
               <chr>                  <chr>
1 "<U+04E2>\xce\xc4" "<U+00B5>\xc2\xce\xc4"

When I use the base function read.csv, it works well. I guess I must do something wrong with encoding. But there are no encoding option in read_csv, how can I do this?

You may check [here](https://stackoverflow.com/questions/22876746/how-to-read-data-in-utf-8-format-in-r) or [here](https://stackoverflow.com/questions/20577764/set-locale-to.-system-default-utf-8). In `read_csv`, there is a `locale` argument. According to documentation `locale The locale controls defaults that vary from place to place. The default locale is US-centric (like R), but you can use locale() to create your own locale that controls things like the default time zone, encoding, decimal mark, big mark, and day/month names`. — akrun, Oct 29 '17 at 03:24
Also note, `readr` can read alternate encodings via `locale`. However, *all readr functions yield strings encoded in UTF-8* according to [package documentation](https://github.com/tidyverse/readr/blob/master/vignettes/locales.Rmd) — Kevin Arseneau, Oct 29 '17 at 03:27
Thanks for your comments!@akrun @Kevin Arseneau I tried as what you said. But it still doesnot work. `Sys.setlocale(category="LC_ALL", locale = "English_United States.1252") read_csv("a,b\n坏,好") Sys.setlocale(category="LC_ALL", locale = "chinese") read_csv("a,b\n坏,好")` — X.Jun, Oct 29 '17 at 07:02

score 7 · Accepted Answer · answered Oct 29 '17 at 11:21

This is because that the characters are marked as UTF-8 whereas the actual encoding is the system default (you can get by stringi::stri_enc_get()).

So, you can do either:

1) Read data with the correct encoding:

df <- read_csv("中文,英文\n英文,德文", locale = locale(encoding = stringi::stri_enc_get()))

2) Read data with the incorrect encoding and mark them with the correct encoding later (note that this does not always work):

df <- read_csv("中文,英文\n英文,德文")
df <- dplyr::mutate_all(df, `Encoding<-`, value = "unknown")

Thanks very much for your suggestion! It works pretty well! – X.Jun Oct 31 '17 at 02:41 — X.Jun, Oct 31 '17 at 02:41

readr::read_csv issue: Chinese Character becomes messy codes

1 Answers1

Linked