1

I am trying to do some work with the well known Reuters-21578 dataset and am having some trouble with loading the sgm files into my corpus.

Right now I am using the command

require(tm)
reut21578 <- system.file("reuters21578", package = "tm")
reuters <-Corpus(DirSource(reut21578), 
    readerControl = list(reader = readReut21578XML))

In an attempt to include all the files into my corpus but this gives me the following error:

Error in DirSource(reut21578) : empty directory

Any idea where I may be going wrong?

Ben
  • 41,615
  • 18
  • 132
  • 227
user1422508
  • 99
  • 1
  • 2
  • 6
  • Have a look at this question - it looks like that data is not included with the `tm` package and you may have to manually download before proceeding. http://stackoverflow.com/questions/10377273/tm-package-error-error-definining-document-term-matrix – Stedy Nov 25 '13 at 04:05
  • @Stedy: The link you provided will definitely be helpful for the rest of my analysis but I have already downloaded the data and what I am doing just doesn't seem to be finding the proper directory. – user1422508 Nov 25 '13 at 04:17
  • 1
    ahh gotcha, ok what I think is happening is that R is looking in the source code directory for `tm`. Why not simplify things by putting the file in `Documents` or Desktop and just call it as `file("Documents/reuters-21578")` – Stedy Nov 25 '13 at 04:23
  • @Stedy is correct, @user1422508 you should replace `Corpus(DirSource(reut21578)...` with `Corpus(DirSource("full-path-to-dir-with-downloaded-data")...` – Ben Nov 25 '13 at 07:37

1 Answers1

6

The "tm" package includes only sample of the Reuters21578 data. If you want to avoid downloading, loading and preparing all the 22 Reuters21578 files, you can use package "tm.corpus.Reuters21578":

install.packages("tm.corpus.Reuters21578", repos = "http://datacube.wu.ac.at")
library(tm.corpus.Reuters21578)
data(Reuters21578)
Lenka Vraná
  • 1,686
  • 2
  • 19
  • 29
  • There is a comment claiming that the URL doesn't work any more. That is the thing with links, they tend to break. Thus "link only" answers are discouraged ... – GhostCat Aug 14 '17 at 13:00
  • It gives me some warnings, but then the package downloads just fine. I also don't think that this is true example of "link only" answer. – Lenka Vraná Sep 03 '17 at 15:19
  • If you could please tell us why we're getting the "Empty Directory" as well, that'd be great, because I converted all SGM files to XML myself, and it's a shame to not get them working. – Shayan Nov 16 '21 at 15:10
  • I tried your answer but I get `RROR: dependency ‘XML’ is not available for package ‘tm.corpus.Reuters21578’` even though I have install `libxml2-dev` – Shayan Nov 16 '21 at 20:20