0

loading a bunch of documents using tm Corpus i need to specify encoding.

All documents are UTF-8 encoded. If openend via text editor content is ok but corpus contents is full of strange symbols (indicioâ., ‘sœs....) Source text is in spanish. ES_es

library(tm)
cname <- file.path("C:", "Users", "john", "Documents", "texts")
docs <- Corpus(DirSource(cname), encoding ="UTF-8")

> Error in Corpus(DirSource(cname), encoding = "UTF-8") : 
  unused argument (encoding = "UTF-8")

EDITED:

Getting str(documents[1]) from corpus I've noticed:

.. ..$ language : chr "en"

How can I specify, for instance "UTF-8", "Latin1" or any other encoding to avoid strange symbols?

Regards

useRj
  • 1,232
  • 1
  • 9
  • 15
  • What do you mean with "strange" - erroneous symbols or symbols you want to have converted to plain text (ASCII) without accents? – Joop Eggen May 17 '16 at 14:16
  • The strange symbols seems to be accented words and such. Converting to ANSI could work. Latin too. – useRj May 17 '16 at 14:37
  • Somewhere else I saw `Encoding(data) <- "UTF-8"`, maybe http://stackoverflow.com/questions/24920396/r-corpus-is-messing-up-my-utf-8-encoded-text – Joop Eggen May 17 '16 at 14:59
  • language > chr "en" if I get on console str(docs[1]) – useRj May 17 '16 at 15:58

2 Answers2

0

From the "C:" it's clear you are using Windows, which assumes a Windows-1252 encoding (on most systems) rather than UTF-8. You could try reading the files in as character and then setting Encoding(myCharVector) <- "UTF-8". If the input encoding was UTF-8 this should cause your system to recognise and display the UTF-8 characters properly.

Alternatively this will work, although it also makes tm unnecessary:

require(quanteda)
docs <- corpus(textfile("C:/Users/john/Documents/texts/*.txt", encoding = "UTF-8"))

Then you can see the texts using for example:

cat(texts(docs)[1:2])

They should have the encoding bit set and display properly. Then if you prefer, you can get these into tm using:

docsTM <- Corpus(VectorSource(texts(docs)))
Ken Benoit
  • 14,454
  • 27
  • 50
  • It's hard for me tiding nose exactly without your files, but the basic approach should work. I'm happy to try it with your files if you send a link to a few of them. – Ken Benoit May 19 '16 at 11:04
  • Hi ken, an issue arised: dtm <- DocumentTermMatrix(docsTM) Error: inherits(doc, "TextDocument") is not TRUE – useRj May 19 '16 at 11:20
  • Quanteda has an equivalent command dfm() btw that works directly on docs – Ken Benoit May 19 '16 at 11:34
0

Seems that there's no need of using quanteda package (besides some odd behaviour losing file names when converting to TM VCorpora)

files <- DirSource(directory = "C:/Users/john/Documents/",encoding ="UTF-8" )
mycorpus<- VCorpus(x=files)

Now encoding is correct.

useRj
  • 1,232
  • 1
  • 9
  • 15
  • I am having this issue, i use `docs<- VCorpus(DirSource(directory = inputdir, encoding ="UTF-8"))`, and my txt files are saved with UTF-8 encoding, but i still get **scal** instead of **fiscal** and **benet** instead of **benefit** in my DTM. Is there something i am doing worng? – Michael Apr 26 '20 at 23:20