4

I am trying to read a csv file generated by Sql Server Management Studio and encoded as UTF-8 (I chose that option when saving it) into R version 3.0.1 (x64) through read.csv2(). I can't get R to display special characters correctly.

If I set fileEncoding="UTF-8-BOM" the import stops at the line where I have a ÿ. However, when opening the file in Notepad++ the ÿ is displayed correctly with UTF-8 encoding. I have tried without setting fileEncoding, but then the special characters aren't displayed correctly (of course).

The csv flie is available here: https://www.dropbox.com/s/7y47i826ikq8ahi/Data.csv

How do I read the csv file and display the text in the right encoding?

Thanks!!

Mace
  • 1,259
  • 4
  • 16
  • 35

3 Answers3

5

I found the answer my self. The problem was with the transformantion from UTF-8 to the system locale (the default encoding in R) through fileEncoding. As I use RStudio, I just changed the default encoding to UTF-8 and removed the fileEncoding="UTF-8-BOM" from read.csv. Then, the entire csv file was read and RStudio displays all characters correctly.

Mace
  • 1,259
  • 4
  • 16
  • 35
2

To those that are still stuck with this issue. My scripts were able to recognise "umlaute" (ä, ö, ü, or ß) by including a line at the top of the script that changes the default option for character encoding options(encoding = "UTF-8") (In my case setting the options in RStudio direclty didn't effect the encodings!).

David
  • 9,216
  • 4
  • 45
  • 78
0

In my case, I have this issue in R inside a docker container (debian and R), when I ran locale in the container all variables appeared empty. I solve the problem adding this in the Dockerfile.

ENV LANG=en_US.UTF-8
ENV LC_CTYPE=en_US.UTF-8
ENV LC_NUMERIC=es_AR.UTF-8
ENV LC_TIME=es_AR.UTF-8
ENV LC_COLLATE=en_US.UTF-8
ENV LC_MONETARY=es_AR.UTF-8
ENV LC_MESSAGES=en_US.UTF-8
ENV LC_PAPER=es_AR.UTF-8
ENV LC_NAME=es_AR.UTF-8
ENV LC_ADDRESS=es_AR.UTF-8
ENV LC_TELEPHONE=es_AR.UTF-8
ENV LC_MEASUREMENT=es_AR.UTF-8
ENV LC_IDENTIFICATION=es_AR.UTF-8
ENV LC_ALL=C.UTF-8

I have es_AR in some values, but I think en_US or other should work.

Emeeus
  • 5,072
  • 2
  • 25
  • 37