3

I have a (small) problem with the tm r library. say I have a corpus:

# boilerplate
bcorp <- c("one","two","three","four","five")
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
tdm <- TermDocumentMatrix(myCorpus)
Docs(tdm)

Result:

[1] "1" "2" "3" "4" "5"

This works. But when I try to use a transformation tm_map():

# this does not work
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
myCorpus <- tm_map(myCorpus, tolower)
tdm <- TermDocumentMatrix(myCorpus)

Gives

Error: inherits(doc, "TextDocument") is not TRUE

The solution proposed in this case was to transform to PlainTextDocument.

# this works but erase the metadata
myCorpus <- Corpus(VectorSource(bcorp), list(lanuage = "en_US"))
myCorpus <- tm_map(myCorpus, tolower)
myCorpus <- tm_map(myCorpus, PlainTextDocument)
tdm <- TermDocumentMatrix(myCorpus)
Docs(tdm)

Result:

[1] "character(0)" "character(0)" "character(0)" "character(0)" "character(0)"

Now it works, but erase all the metadata (in this case the doc names). There is a way to mantain the metadata, or to save and then restore them?

momobo
  • 1,755
  • 1
  • 14
  • 19
  • I thought that myself, but I did not found it either in VectorSource() nor in Corpus() or in tm_map() help file. – momobo Sep 03 '14 at 08:03
  • Upon calling `TermDocumentMatrix`, I get `Error in UseMethod("meta", x) : ` – Rich Scriven Sep 03 '14 at 08:12
  • I'm interested to know if when you used the same name `myCorpus` in successive assignments, if it may have changed your data attributes, because that's an attributes check function `inherits` – Rich Scriven Sep 03 '14 at 08:18
  • Thank you Richard. I found a solution myself. – momobo Sep 03 '14 at 08:27
  • Possible duplicate of [DocumentTermMatrix error on Corpus argument](http://stackoverflow.com/questions/24191728/documenttermmatrix-error-on-corpus-argument) – Hardik Gupta Jan 12 '17 at 12:32

1 Answers1

8

I found it.

The line:

myCorpus <- tm_map(myCorpus, PlainTextDocument)

solves the problem but erase the metadata.

I found this answer that explain a better way to use tm_map(). I just have to substitute:

myCorpus <- tm_map(myCorpus, tolower)

with:

myCorpus <- tm_map(myCorpus, content_transformer(tolower))

And all works!

Community
  • 1
  • 1
momobo
  • 1,755
  • 1
  • 14
  • 19