1

I'm doing some text mining involving Portuguese text. Some of my custom text mining functions also have other special characters in them.

I'm no expert on this topic. When a lot of my characters started displaying incorrectly, I assumed I needed to change the file encoding. I tried

  • ISO-8858-1
  • ISO-8858-7
  • UTF-8
  • WINDOWS-1252

None of them improved the display of characters. Do I need a different encoding or am I going about this all wrong?

For example, when I try to read this list of stopwords from GitHub:

stop_words <- read.table("https://gist.githubusercontent.com/alopes/5358189/raw/2107d809cca6b83ce3d8e04dbd9463283025284f/stopwords.txt") 

They come out like this:

tail(stop_words, 17)
206    tivéramos
207         tenha
208      tenhamos
209        tenham
210       tivesse
211   tivéssemos
212      tivessem
213         tiver
214      tivermos
215       tiverem
216         terei
217         terá
218       teremos
219        terão
220         teria
221     teríamos
222        teriam

I've also tried it with stringsAsFactors = F.

I don't speak Portuguese, but my instinct tells me that the Euro and copyright symbols are not in their alphabet. Also, it seems to be changing some accented lowercase e's to uppercase differently accented A's.

In case it's helpful:

Sys.getlocale()

[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

I also tried changing locale, stri_encode(stop_words$V1, "", "UTF-8") and tail(enc2native(as.vector(stop_words[,1])),17).

Hack-R
  • 22,422
  • 14
  • 75
  • 131
  • I don't think the problem is with the Portuguese alphabet. When I get the stop_words from GitHub with your code above, I can see the characters properly formatted. How are you changing the file encoding? – Oriol Mirosa Jul 27 '17 at 18:57
  • @OriolMirosa I had the problem before changing encoding from my system default, which is ISO-8859-1. I tried changing it using RStudio (Reopen with encoding) then repulling the data. I also tried changing it with the `stringi` package. I think that the answer below is correct that it's being double-encoded somehow, but I don't know why or how to fix it. – Hack-R Jul 27 '17 at 18:59
  • Have you tried `enc2utf8(as.vector(stop_words[,1]))` or `enc2native(as.vector(stop_words[,1]))` – Oriol Mirosa Jul 27 '17 at 19:02
  • @OriolMirosa I had not tried that, thanks. I just tried it now after your reading your comment, but the problem is still there. – Hack-R Jul 27 '17 at 19:06
  • Hmm... What system are you in? Do you use RStudio? What font are you using for your R terminal? Can you see tildes and other latin characters in your terminal? (if your keyboard is in English, press alt+e and then e to get 'é') – Oriol Mirosa Jul 27 '17 at 19:36
  • @OriolMirosa It's Windows 7, R 3.4.1, and RStudio. I don't seem to be able to get the accented e that way. Other Latin and accented characters appear normally in RStudio though. – Hack-R Jul 27 '17 at 19:52
  • When you say that other Latin and accented characters appear normally in RStudio, do you mean also in the console? Or are they never appearing properly *only* on the console? Have you tried using command-line R and seeing if the problem also happens there or only in RStudio? What's your setting in RStudio Preferences > Code > Saving > Default text encoding? – Oriol Mirosa Jul 27 '17 at 20:07
  • @OriolMirosa Good questions. I mean that `àáâãäåçèéêëìíîïðñòóôõöùúûüý` can show up in the code window and in the console. However there are some that can't, like the ones in the question and `šž`. I tried R CLI and Microsoft's R on the command line and had the same problem. – Hack-R Jul 27 '17 at 20:12
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/150349/discussion-between-oriol-mirosa-and-hack-r). – Oriol Mirosa Jul 27 '17 at 20:24

2 Answers2

1

You seem to be double encoding to utf-8.

Here is a chart of the characters in utf-8: http://www.i18nqa.com/debug/utf8-debug.html.
Now look at the "Actual" column.

As you can see, the characters printed seems to represent the actual value instead of the encoded value.

A temporary fix would be to decode one layer of utf-8.

Update:

After installing R, I tried to reproduce the problem.
Here is my console log with a simple explanation:

First, I copy pasted your code:

> stop_words <- read.table("https://gist.githubusercontent.com/alopes/5358189/raw/2107d809cca6b83ce3d8e04dbd9463283025284f/stopwords.txt")
> tail(stop_words, 17)
             V1
206  tivéramos
207       tenha
208    tenhamos
209      tenham
210     tivesse
211 tivéssemos
212    tivessem
213       tiver
214    tivermos
215     tiverem
216       terei
217       terá
218     teremos
219      terão
220       teria
221   teríamos
222      teriam

Ok, so it didn't work as is, so I added the encoding parameter at the end of the read.table function. There goes the result when I tried with lower case utf-8:

> stop_words <- read.table("https://gist.githubusercontent.com/alopes/5358189/raw/2107d809cca6b83ce3d8e04dbd9463283025284f/stopwords.txt",encoding="utf-8")
> tail(stop_words, 17)
             V1
206  tivéramos
207       tenha
208    tenhamos
209      tenham
210     tivesse
211 tivéssemos
212    tivessem
213       tiver
214    tivermos
215     tiverem
216       terei
217       terá
218     teremos
219      terão
220       teria
221   teríamos
222      teriam

Finally, I used UTF-8 with capital letters and now it works properly:

> stop_words <- read.table("https://gist.githubusercontent.com/alopes/5358189/raw/2107d809cca6b83ce3d8e04dbd9463283025284f/stopwords.txt", encoding = "UTF-8")
> tail(stop_words, 17)
            V1
206  tivéramos
207      tenha
208   tenhamos
209     tenham
210    tivesse
211 tivéssemos
212   tivessem
213      tiver
214   tivermos
215    tiverem
216      terei
217       terá
218    teremos
219      terão
220      teria
221   teríamos
222     teriam

You might have forgotten to put the encoding parameter at the end of read.table or tried it with lower case instead of upper. What I understand from this is that R tries to cast the characters to UTF-8 if you don't specify that it is already encoded in it.

  • I can see from the chart that you're correct. I'm trying to figure out how to follow your advice. If you know how to do so could you perhaps show me using the linked GitHub text? I see some examples of how to fix double encoding in Python, but not R. – Hack-R Jul 27 '17 at 18:51
  • 1
    I might look into this later if no answer is to be found. – Alexandre Mercier Aubin Jul 27 '17 at 19:27
1

I am Portuguese and I had the same problem though my encoding is

Sys.getlocale()
[1] "LC_COLLATE=Portuguese_Portugal.1252;LC_CTYPE=Portuguese_Portugal.1252;LC_MONETARY=Portuguese_Portugal.1252;LC_NUMERIC=C;LC_TIME=Portuguese_Portugal.1252"

So I looked it up online and found this tip in SO.

stop_words2 <- sapply(stop_words, as.character)

It worked. But I read in the data using read.table(..., stringsAsfactors = FALSE).

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Thank you very much. This didn't work for me, but we can keep the answer for future readers who may have the same problem/solution as your case. – Hack-R Jul 27 '17 at 19:04
  • @Hack-R: Maybe it didn't work because of your locale. Can't you change it? – Rui Barradas Jul 27 '17 at 19:13