22

I'm trying to read the following UTF-8 encoded file in R, but whenever I read it, the unicode characters are not encoded correctly:

enter image description here

The script I'm using to process the file is as follows:

defaultEncoding <- "UTF8"
detalheVotacaoMunicipioZonaTypes <- c("character", "character", "factor", "factor", "factor", "factor", "factor",
                                                     "factor", "factor", "factor", "factor", "factor", "numeric", 
                                                     "numeric", "numeric", "numeric", "numeric", "numeric",
                                                     "numeric", "numeric", "numeric", "numeric", "numeric", 
                                                     "numeric", "character", "character")

readDetalheVotacaoMunicipioZona <- function( fileName ) {
  fileConnection = file(fileName,encoding=defaultEncoding)
  contents <- readChar(fileConnection, file.info(fileName)$size)  
  close(fileConnection)
  contents <- gsub('"', "", contents)

  columnNames <- c("data_geracao", "hora_geracao", "ano_eleicao", "num_turno", "descricao_eleicao", "sigla_uf", "sigla_ue",
                   "codigo_municipio", "nome_municipio", "numero_zona", "codigo_cargo", "descricao_cargo", "qtd_aptos", 
                   "qtd_secoes", "qtd_secoes_agregadas", "qtd_aptos_tot", "qtd_secoes_tot", "qtd_comparecimento",
                   "qtd_abstencoes", "qtd_votos_nominais", "qtd_votos_brancos", "qtd_votos_nulos", "qtd_votos_legenda", 
                   "qtd_votos_anulados", "data_ult_totalizacao", "hora_ult_totalizacao")

  read.csv(text=contents, 
           colClasses=detalheVotacaoMunicipioZonaTypes,
           sep=";", 
           col.names=columnNames, 
           fileEncoding=defaultEncoding,
           header=FALSE)
}

I read the file sending in the UTF-8 encoding, remove all quotes (even numbers are quoted, so I need to clean them up) and then feed the contents to read.csv. It reads and processes the file correctly but it seems like it's not using the encoding information I'm giving it.

What should I do to make it use UTF-8 to read this file?

I'm using RStudio on OSX if it makes any difference.

smci
  • 32,567
  • 20
  • 113
  • 146
Maurício Linhares
  • 39,901
  • 14
  • 121
  • 158
  • I don’t know how text is stored internally in R, but in any case it seems like you’re attempting to decode UTF-8 *twice* (but R should disregard that according to the documentation). – Konrad Rudolph Apr 27 '14 at 15:02
  • Are you sure the file is properly encoded? Are you using RStudio? It could be that it is read correctly but not displayed correctly in their interface (I can't find the issue now, maybe it has been closed). – ilir Apr 27 '14 at 15:47
  • Inside r-studio it doesn't work, if I do it in a console session it works. That's weird. – Maurício Linhares Apr 27 '14 at 16:43
  • 2
    If the only problem is RStudio, go to RStudio->Preferences:General, tell us what 'Default text encoding:'is set to, click 'Change' and try UTF-8 or ISO8859-1('latin1'). Let us know which one worked! – smci May 07 '14 at 20:21
  • Your .csv file on github looks like correctly-encoded Windows-1252 to me. You say the problem only happens inside RStudio. So let's try setting both the locale and default character encodings (try CP1252, UTF-8, ISO8859-1 in that order). See my answer below. – smci May 07 '14 at 20:28
  • Tagged 'RStudio' and retitled, since you say this issue is caused by RStudio, not R itself. (Revert that if that's not the case) – smci May 07 '14 at 20:42

5 Answers5

19

This problem is caused by the wrong locale being set, whether inside RStudio or command-line R:

  1. If the problem only happens in RStudio not command-line R, go to RStudio->Preferences:General, tell us what 'Default text encoding:'is set to, click 'Change' and try Windows-1252, UTF-8 or ISO8859-1('latin1') (or else 'Ask' if you always want to be prompted). Screenshot attached at bottom. Let us know which one worked!

  2. If the problem also happens in command-line R, do the following:

Do locale -m on your Mac and tell us whether it supports CP1252 or else ISO8859-1 ('latin1')? Dump the list of supported locales if you need to. (You might as well tell us your version of MacOS while you're at it.)

For both of those locales, try to change to that locale:

# first try Windows CP1252, although that's almost surely not supported on Mac:
Sys.setlocale("LC_ALL", "pt_PT.1252") # Make sure not to omit the `"LC_ALL",` first argument, it will fail.
Sys.setlocale("LC_ALL", "pt_PT.CP1252") # the name might need to be 'CP1252'

# next try IS08859-1(/'latin1'), this works for me:
Sys.setlocale("LC_ALL", "pt_PT.ISO8859-1")

# Try "pt_PT.UTF-8" too...

# in your program, make sure the Sys.setlocale worked, sprinkle this assertion in your code before attempting to read.csv:
stopifnot(Sys.getlocale('LC_CTYPE') == "pt_PT.ISO8859-1")

That should work. Strictly the Sys.setlocale() command should go in your ~/.Rprofile for startup, not inside your R session or source-code. However Sys.setlocale() can fail, so just be aware of that. Also, assert Sys.getlocale() inside your setup code early and often, as I do. (really, read.csv should figure out if the encoding it uses is compatible with the locale, and warn or error if not).

Let us know which fix worked! I'm trying to document this more generally so we can figure out the correct enhance.

  1. Screenshot of RStudio Preferences Change default text encoding menu: enter image description here
smci
  • 32,567
  • 20
  • 113
  • 146
  • 1
    I notice you're in Brazil, so instead of `pt_PT.` -> `pt_BR.`. Shouldn't make any difference for this issue though. – smci May 07 '14 at 20:37
  • 1
    +1. I think this is quite relevant for Portuguese users of both R and RStudio. – Paulo E. Cardoso May 07 '14 at 23:10
  • @PauloCardoso: it's relevant for *all* international users of R and RStudio using foreign character sets or Unicode. – smci May 08 '14 at 09:46
  • 1
    In RStudio 1.0.136, this settings is moved to `Options > Code > Default text encoding` but data still in wrong encoding, showing things in Latin-1, not in UTF-8. – hhh Feb 06 '17 at 07:14
  • 2
    In RStudio 1.0.143 I couldn't find RStudio->Preferences:General. And at Options > Code there is no "Default text encoding" option – Homero Esmeraldo Jun 20 '17 at 18:51
  • Many Thanks! So what is that UTF-8 for? I thought it's like encoding esperanto, meanwhile, besides postgres db is set to utf-8 and my RStudio as well, this db connections (and most of web content) works with i.e. Windows-1252,not utf-8 – Peter.k Jan 24 '18 at 22:00
  • 1
    @Peter.k: UTF-8/16 are the two most common encoding schemes for Unicode; see [*What is Unicode, UTF-8, UTF-16? *](https://stackoverflow.com/questions/2241348/what-is-unicode-utf-8-utf-16). For supporting foreign languages, and other fonts/ symbols (/ emoticons). But if you find a bug in RStudio IDE itself, please post it at https://github.com/rstudio/rstudio/issues . At least read through their existing issues, upvote, subscribe etc. RStudio frequently has bugs, so be community-minded and report them... – smci Jan 24 '18 at 22:06
  • 1
    You can find the preferences under global options -> code -> saving in the newer versions of Rstudio – Nina van Bruggen Jul 08 '21 at 14:58
5

It works fine for me.

Did you try to change/reset locale?

in my case it works with

Sys.setlocale(category = "LC_ALL", locale = "Portuguese_Portugal.1252")

d <- read.table(text=readClipboard(), header=TRUE, sep = ';')

head(d)

1  25/04/2014  22:29:30  2012  1 ELEIÇÃO MUNICIPAL 2012 PB  20419    20419      ITAPORANGA  33  13 VEREADOR 17157
2  25/04/2014  22:29:30  2012  1 ELEIÇÃO MUNICIPAL 2012 PB  20770    20770           MALTA  51  11 PREFEITO  4677
3  25/04/2014  22:29:30  2012  1 ELEIÇÃO MUNICIPAL 2012 PB  21091    21091     OLHO D'ÁGUA  32  13 VEREADOR  6653
4  25/04/2014  22:29:30  2012  1 ELEIÇÃO MUNICIPAL 2012 PB  21113    21113        OLIVEDOS  23  13 VEREADOR  3243
...
Paulo E. Cardoso
  • 5,778
  • 32
  • 42
  • 1
    This is what I get when I try that: `OS reports request to set locale to "Portuguese_Portugal.1252" cannot be honored`. I'm using a mac if it makes any difference. – Maurício Linhares Apr 27 '14 at 16:43
  • @MaurícioLinhares hard to say, specially with mac. did you see [this topic](http://stackoverflow.com/q/5345132/640783)? – Paulo E. Cardoso Apr 27 '14 at 16:48
  • @MaurícioLinhares: make sure you typed `Sys.setlocale("LC_ALL", "Portuguese_Portugal.1252")`. If you omit the `"LC_ALL",` first argument, it will fail. – smci May 07 '14 at 20:07
  • @MaurícioLinhares: make sure you typed Sys.setlocale("LC_ALL", "Portuguese_Portugal.1252"). If you omit the "LC_ALL", first argument, it will fail. Second, it might be `.CP1252` instead of `1252`. Thirdly, `CP1252` might not supported on Mac anyway, see `locale -m` to see what your machine supports (what version of MacOS are you anyway?). – smci May 07 '14 at 20:32
  • 1
    @Paulo: instead of `Portuguese_Portugal.`, just say `pt_PT.`. Or `pt_BR.` – smci May 07 '14 at 20:35
  • @smci "Portuguese_Portugal" is what I get with sessionInfo but ok. nice to know that both cases will work! – Paulo E. Cardoso May 07 '14 at 22:58
2

I had the same problem with Portuguese locale in r (MAC OS 10.12.3) I've tried as per thread above and no one worked. Then I found this webpage: https://docs.moodle.org/dev/Table_of_locales and just tried Sys.setlocale(category = "LC_ALL", locale = "pt_PT.UTF-8") and it works.

  • This worked in my case also `Sys.setlocale(category = "LC_ALL", locale = "no_NO.UTF-8")`. Before this `Sys.getlocale()` returned only `[1] "C"` and after it output `[1] "no_NO.UTF-8/no_NO.UTF-8/no_NO.UTF-8/C/no_NO.UTF-8/C"`. Test with your local letters something like this `read.table(text='"å", "æ", "ø", "ä"', sep=",")` – Avec Aug 04 '17 at 09:46
0

You should try library(readr) functions, such as read_csv() or read_fwf()(note the underscore instead of the dot), it guesses the encoding of the file, usually succeeds in doing so; these readr functions come bundled in RStudio GUI function "import dataset"

Elio Diaz
  • 566
  • 2
  • 19
0

If your system is Mac, open terminal, copy this code

defaults write org.R-project.R force.LANG en_US.UTF-8

paste and run. I hope it works. I had the same problem.