0

We are cleansing some marketing data in traditional Chinese. We found R can read UTF-8 traditional Chinese variable names without any problem. However, we can not get valid UTF-8 output there. For example,

If we command: unique(rframe$性別)

This is what we got: [1] "\u5973" "\u7537"

In which 性別 is "gender," \u5973 means female (女), and \u7537 means male (男).

The most interesting thing is R on the Linux platform generates the valid UTF-8 Chinese output if we use the same UTF-8 CSV file. Why does the same RStudio, which can generate Chinese output encoding in UTF-8 on the Linux platform successfully, cannot output valid UTF-8 Chinese output on the Mac system?

This very troublesome issue has been there for a long while. In fact, in the older RStudio version, we could get valid UTF-8 output. Can any friend help us?

Much obliged.

Chandler

Phil
  • 7,287
  • 3
  • 36
  • 66

3 Answers3

1

This issue comes from a bug in R, version 4.0.4, source code. The UTF-8 code could not be displayed validly on both Windows and Mac. It is fixed on version 4.0.5.

0

The error may be in the import of the data. How did you import your data?

I tried by importing some data with Chinese characters and using specifically encoding="UTF-8" and I don't have any issues.

So my first suggestion is to try this:

data <-read.csv("mydata.csv", encoding="UTF-8", stringsAsFactors=FALSE)

An additional approach could be to specify your variables as characters. According to following answer. So you get the Chinese character instead of the unicode.

as.character(unique(rframe$性別))

If you provide an excerpt from the data, I can check and possibly confirm this.

Albin
  • 822
  • 1
  • 7
  • 25
  • Yes, we added the encoding options you recap when we import the CSV file. Of course, thanks for offering the information around six years ago about "getting Chinese characters instead of the Unicode." However, nowadays, it is unusual to use the Chinese character rather than UTF-8. The most interesting thing is R on the Linux platform generates the valid UTF-8 Chinese output if we use the same UTF-8 CSV file. Why does the same RStudio, which can generate Chinese output encoding in UTF-8 on the Linux platform successfully, cannot output valid UTF-8 Chinese output on the Mac system? – Chandler C. Chu Mar 27 '21 at 20:00
  • I can't answer this question as I'm using R on MacOS and my suggestion generated the correct output. If you can provide some example data I can investigate it further – Albin Mar 27 '21 at 20:05
  • Okay. Here is a sample file. Much appreciated. https://www.dropbox.com/s/238aet54wzd53zf/rframe10.csv?dl=0 – Chandler C. Chu Mar 27 '21 at 21:33
  • Thank you for providing the data. I reproduced the findings based on the data and get the correct output. By investigating this issue further I stumbled on another possible reason for this error. You can go to RStudio >> Preferences and select Code. From there select the Saving tab and set UTF-8 as your default text encoding. If this is not providing the solution I hope your direct message to R will be able to help you – Albin Mar 28 '21 at 09:21
  • Thanks for your tip. The UTF-8 is the default text encoding since we installed R on our Mac. – Chandler C. Chu Mar 28 '21 at 11:19
0

After a few trials and errors, we found this issue probably coming from the process of generating the R application on Mac.

We downloaded R from Git and compile an application, thru the Apple clang version 12.0.0 (clang-1200.0.32.29, Target: x86_64-apple-darwin19.6.0), from source code. It works fine. Our troublesome issue does not emerge again. We reported to R society our findings today. We hope people can see a quick response soon.

The following message is the report we sent to R.

To: Bug-Report-Request bug-report-request@r-project.org

Hi,

I am more of a system programmer that helps my friend (Chandler) use R to process Data. He has quite some trouble getting Chinese / Unicode output on the terminal. However, that only happens on Mac. I can't reproduce it on Linux.

I think something that might be wrong on R - Mac version. I re-compile R with the source code from GitHub, and I can't reproduce this issue. With the one download from the website, it can be reproduced, with failed rate 100%.

The details live in https://www.facebook.com/groups/RnRStudio/permalink/4555694011125386/

I think that's because the toolchain to compile R / MAC could be out of date.

If you can create a bug on Bugzilla and enable me to comment there, I won't need a Bugzilla account. Or if any of you can sponsor on this issue, that's even better.

Or I'll need a Bugzilla account.

Thank you!