1

I want to read text file in R. The code used to work. But when I want to retest it, it didn't.

#There are several text files in file'Obama' and file 'Romney'
candidates<-c("Obama","Romney")
pathname<-"C:/txt"
s.dir<-sprintf("%s/%s",pathname,candidates)
article<-Corpus(DirSource(directory=s.dir,encoding="ANSI"))

The error it displayed is

Error in iconv(readLines(x, warn = FALSE), encoding, "UTF-8", "byte") : 
unsupported conversion from 'ANSI' to 'UTF-8' in codepage 936

Also, when I use the code below to try to read a single text file:

m<-"C:/txt/Romney/1.txt"
cc<-Corpus(DirSource(directory=m,encoding="ANSI"))

It displayed:

Error in DirSource(directory = m, encoding = "ANSI") : empty directory

The file path exist, why I met this problem?

user3746295
  • 101
  • 2
  • 11
  • How sure are you this used to work? Have you upgraded `tm` or are on a different computer? It doesn't appear you can use `DirSource` to open a single file like that. Also, are you sure you have the right encoding for your source files? – MrFlick Jul 09 '14 at 01:40
  • @MrFlick This is part of code doing machine learning to classify Obama's speech or Romney's speech. It used to be able to do the classification. And everytime I reopen R studio, it seems that I have to install tm package again. Is this one causing the problem? About use Dirsource to open a single file, this is maybe wrong. It isn't included in my classification code. I just want to show the file path exist. So what is right code to open a single file? – user3746295 Jul 09 '14 at 18:21
  • Well, there was a recent update to `tm` which seem to change how some stuff worked (but i'm not an active `tm` user myself so i'm not familiar with all the details. See [the news page](http://cran.r-project.org/web/packages/tm/news.html)). For a single file, I would try `cc<-Corpus(URISource("file://C:/txt/Romney/1.txt",encoding="ANSI"))` – MrFlick Jul 09 '14 at 18:40
  • @MrFlick So is there any chance I install the old version of tm package every time I install it? – user3746295 Jul 09 '14 at 20:00
  • I wouldn't think so. This encoding stuff is new to 0.6 I think. But you can check your `sessionInfo()` to see what's being loaded. – MrFlick Jul 09 '14 at 20:24
  • I have exactly the same problem. My code used to work perfectly but "stuff happened" after updating R studio... – Kasper Christensen Jul 15 '14 at 08:12
  • I know its not a fix to the bug, but it will at least get you moving. Go to "http://cran.r-project.org/web/packages/tm/index.html" and download and install the old version of tm, and wait until it is fixed. – Kasper Christensen Jul 15 '14 at 08:30

3 Answers3

1

Following is what you needed to do:

  1. Change the article<-Corpus(DirSource(directory=s.dir,encoding="ANSI")) to following:

article <- VCorpus(DirSource(directory = s.dir), readerControl = list(reader=readPlain))

  1. In cleanCorpus function, change the corpus.tmp <- tm_map(corpus.tmp, tolower) to following:

corpus.tmp <- tm_map(corpus.tmp, content_transformer(tolower))

Pay attention to usage of "content_transformer" function.

Once done with above, you should be able to fix the problem.

Ajitesh
  • 956
  • 10
  • 14
0

Go to "cran.r-project.org/web/packages/tm/index.html"; and download and install the old version of tm, and wait until the bug is fixed.

Kasper Christensen
  • 895
  • 3
  • 10
  • 30
0

s.cor <- Corpus(DirSource(directory = s.dir, encoding = "ANSI"))

I changed encoding="ANSI" to encoding="UTF-8". It worked.

s.cor <- Corpus(DirSource(directory = s.dir, encoding = "UTF-8"))

Emeka
  • 11
  • 1