set encoding for reading text files into tm Corpora

Question

loading a bunch of documents using tm Corpus i need to specify encoding.

All documents are UTF-8 encoded. If openend via text editor content is ok but corpus contents is full of strange symbols (indicioâ., ‘sœs....) Source text is in spanish. ES_es

library(tm)
cname <- file.path("C:", "Users", "john", "Documents", "texts")
docs <- Corpus(DirSource(cname), encoding ="UTF-8")

> Error in Corpus(DirSource(cname), encoding = "UTF-8") : 
  unused argument (encoding = "UTF-8")

EDITED:

Getting str(documents[1]) from corpus I've noticed:

.. ..$ language : chr "en"

How can I specify, for instance "UTF-8", "Latin1" or any other encoding to avoid strange symbols?

Regards

What do you mean with "strange" - erroneous symbols or symbols you want to have converted to plain text (ASCII) without accents? — Joop Eggen, May 17 '16 at 14:16
The strange symbols seems to be accented words and such. Converting to ANSI could work. Latin too. — useRj, May 17 '16 at 14:37
Somewhere else I saw `Encoding(data) <- "UTF-8"`, maybe http://stackoverflow.com/questions/24920396/r-corpus-is-messing-up-my-utf-8-encoded-text — Joop Eggen, May 17 '16 at 14:59

score 0 · Accepted Answer · answered May 18 '16 at 06:54

0

From the "C:" it's clear you are using Windows, which assumes a Windows-1252 encoding (on most systems) rather than UTF-8. You could try reading the files in as character and then setting Encoding(myCharVector) <- "UTF-8". If the input encoding was UTF-8 this should cause your system to recognise and display the UTF-8 characters properly.

Alternatively this will work, although it also makes tm unnecessary:

require(quanteda)
docs <- corpus(textfile("C:/Users/john/Documents/texts/*.txt", encoding = "UTF-8"))

Then you can see the texts using for example:

cat(texts(docs)[1:2])

They should have the encoding bit set and display properly. Then if you prefer, you can get these into tm using:

docsTM <- Corpus(VectorSource(texts(docs)))

answered May 18 '16 at 06:54

Ken Benoit

14,454
27
50

It's hard for me tiding nose exactly without your files, but the basic approach should work. I'm happy to try it with your files if you send a link to a few of them. – Ken Benoit May 19 '16 at 11:04
Hi ken, an issue arised: dtm <- DocumentTermMatrix(docsTM) Error: inherits(doc, "TextDocument") is not TRUE – useRj May 19 '16 at 11:20
Quanteda has an equivalent command dfm() btw that works directly on docs – Ken Benoit May 19 '16 at 11:34

score 0 · Answer 2 · answered May 27 '16 at 14:54

0

Seems that there's no need of using quanteda package (besides some odd behaviour losing file names when converting to TM VCorpora)

files <- DirSource(directory = "C:/Users/john/Documents/",encoding ="UTF-8" )
mycorpus<- VCorpus(x=files)

Now encoding is correct.

answered May 27 '16 at 14:54

useRj

1,232
1
9
15

I am having this issue, i use `docs<- VCorpus(DirSource(directory = inputdir, encoding ="UTF-8"))`, and my txt files are saved with UTF-8 encoding, but i still get **scal** instead of **fiscal** and **benet** instead of **benefit** in my DTM. Is there something i am doing worng? – Michael Apr 26 '20 at 23:20

set encoding for reading text files into tm Corpora

2 Answers2

Linked