Chinese Character Encoding in R Studio

Question

I am currently working with large CSVs of Chinese medical records in R Studio but am having trouble processing Han Chinese characters. In particular, I am able to "view" Chinese characters in table form (i.e. using R Studio's built-in data viewer to see the entire dataset), but I am unable to render them in outputs from code-chunks of the R Markdown -- i.e. unable to "process"/"interact" with them.

I've already tried setting the system locale to Simplified Chinese via Sys.setlocale(category = "LC_CTYPE", locale = "chs"), reading in the .csv with UTF-8 encoding via read.csv('filepath/filename.csv', encoding = "UTF-8", stringsAsFactors = FALSE), and even changing the OS system language (Windows 10), but all to no avail.

Any thoughts you may have on what seems like a "dual-treatment" of Chinese characters in R Studio are greatly appreciated!

Donald Seinen · Answer 1 · 2021-03-23T08:03:24.907

Today I came across a similar problem - knitting a markdown document with an interactive plot including some Chinese characters. Code runs fine in normal R session, but output was different when knitting a markdown doc, where a data wrangling step resulted in <U+xxxx>, messing up subsequent plotting and thus knitting.

 df %>%
    filter(str_detect(variable, "皖"))

For me, adding the following line in the markdown document worked:

Sys.setlocale(category = "LC_CTYPE", "Chinese (Simplified)_China.936")

Finding all available locales on a machine is not easy, it is a deep rabbit hole. As to why different things happen in an R session and when knitting, I suspect it is due to some default setting when the script is evaluated, but that's speculation.

A little digging into what might be happening here: knitr::render uses %n% to get network attribute of a new? global environment by using a primitive function globalenv(). We can inspect this function following this advice, finding do_globalenv in R's source code.

Chinese Character Encoding in R Studio

1 Answers1