56

I have the following code:

# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings.

corpus_clean <- tm_map(news_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords('english'))
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <- tm_map(corpus_clean, trim)

news_dtm <- DocumentTermMatrix(corpus_clean) # errors here

When I run the DocumentTermMatrix() method, it gives me this error:

Error: inherits(doc, "TextDocument") is not TRUE

Why do I get this error? Are my rows not text documents?

Here is the output upon inspecting corpus_clean:

[[153]]
[1] obama holds technical school model us

[[154]]
[1] oil boom produces jobs bonanza archaeologists

[[155]]
[1] islamic terrorist group expands territory captures tikrit

[[156]]
[1] republicans democrats feel eric cantors loss

[[157]]
[1] tea party candidates try build cantor loss

[[158]]
[1] vehicles materials stored delaware bridges

[[159]]
[1] hill testimony hagel defends bergdahl trade

[[160]]
[1] tweet selfpropagates tweetdeck

[[161]]
[1] blackwater guards face trial iraq shootings

[[162]]
[1] calif man among soldiers killed afghanistan

[[163]]
[1] stocks fall back world bank cuts growth outlook

[[164]]
[1] jabhat alnusra longer useful turkey

[[165]]
[1] catholic bishops keep focus abortion marriage

[[166]]
[1] barbra streisand visits hill heart disease

[[167]]
[1] rand paul cantors loss reason stop talking immigration

[[168]]
[1] israeli airstrike kills northern gaza

Edit: Here is my data:

type,text
neutral,The week in 32 photos
neutral,Look at me! 22 selfies of the week
neutral,Inside rebel tunnels in Homs
neutral,Voices from Ukraine
neutral,Water dries up ahead of World Cup
positive,Who's your hero? Nominate them
neutral,Anderson Cooper: Here's how
positive,"At fire scene, she rescues the pet"
neutral,Hunger in the land of plenty
positive,Helping women escape 'the life'
neutral,A tour of the sex underworld
neutral,Miss Universe Thailand steps down
neutral,China's 'naked officials' crackdown
negative,More held over Pakistan stoning
neutral,Watch landmark Cold War series
neutral,In photos: History of the Cold War
neutral,Turtle predicts World Cup winner
neutral,What devoured great white?
positive,Nun wins Italy's 'The Voice'
neutral,Bride Price app sparks debate
neutral,China to deport 'pork' artist
negative,Lightning hits moving car
neutral,Singer won't be silenced
neutral,Poland's mini desert
neutral,When monarchs retire
negative,Murder on Street View?
positive,Meet armless table tennis champ
neutral,Incredible 400 year-old globes
positive,Man saves falling baby
neutral,World's most controversial foods

Which I retrieve like:

news_raw <- read.csv('news_csv.csv', stringsAsFactors = F)

Edit: Here is the traceback():

> news_dtm <- DocumentTermMatrix(corpus_clean)
Error: inherits(doc, "TextDocument") is not TRUE
> traceback()
9: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"), 
       ch), call. = FALSE, domain = NA)
8: stopifnot(inherits(doc, "TextDocument"), is.list(control))
7: FUN(X[[1L]], ...)
6: lapply(X, FUN, ...)
5: mclapply(unname(content(x)), termFreq, control)
4: TermDocumentMatrix.VCorpus(x, control)
3: TermDocumentMatrix(x, control)
2: t(TermDocumentMatrix(x, control))
1: DocumentTermMatrix(corpus_clean)

When I evaluate inherits(corpus_clean, "TextDocument") it is FALSE.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
user1477388
  • 20,790
  • 32
  • 144
  • 264
  • 1
    If i use `data(crude); news_corpus <- crude;` and then run all your transformations, I do not get the error. What exactly does `news_raw$text` look like? What class is it? – MrFlick Jun 12 '14 at 19:09
  • It is a character class. That doesn't sound right - how can I change it? – user1477388 Jun 12 '14 at 19:11
  • 1
    Actually "character" is correct. That's just what R calls it. Other languages may call them strings. But as it stands, I can't reproduce your problem without data. Can you provide a minimal, reproducible example that I can run to get the same error? – MrFlick Jun 12 '14 at 19:12
  • Absolutely, please see my edit. (I am trying to build a type of "sentiment analysis" program.) – user1477388 Jun 12 '14 at 19:14
  • Thank you for editing and posting data, however when I run the same commands on the new data I do not get an error. Everything works fine. I'm wondering if you accidentally replaced/shadowed a function somewhere along the way. I would close and restart R to see if it works then. Or before closing check out `conflicts()` before closing to see if something looks odd. – MrFlick Jun 12 '14 at 19:19
  • I closed and re-opened. conflicts() reads ""data" "body<-" "kronecker"." Is there anything that is known to cause that error? I google'd it but couldn't find anything. Maybe there is some strange character somewhere in my data that's throwing it off? – user1477388 Jun 12 '14 at 19:22
  • 1
    Do you still get the error? I suppose adding the results of `traceback()` should hopefully identify the (sub)function where the error is occurring. Just run that command after you get the error. – MrFlick Jun 12 '14 at 19:35
  • That's interesting, thanks. I updated the question with the traceback(). I don't really know what it is saying. – user1477388 Jun 12 '14 at 19:37
  • The traceback says the error is ultimately occurring in the `termFreq` function. How about one more thing. What does `table(sapply(corpus_clean, class))` return? And have you tried without the `trim` step? – MrFlick Jun 12 '14 at 20:05
  • The output of `table(sapply(corpus_clean, class))` is `character 168`. I have tried without `trim` and it doesn't work. Actually, I added `trim` because I thought the leading whitespaces were the problem (seems not). – user1477388 Jun 13 '14 at 12:55
  • 1
    Well that is a problem. You're really running the code exactly as above? The `Corpus(VectorSource(news_raw$text))` should convert everything to a plain text document. When i run the `sapply( ,class)` I get `character, PlainTextDocument, TextDocument`. – MrFlick Jun 13 '14 at 13:04
  • Yes, I am running it exactly as I have posted it here. Any idea why it isn't converting my `news_raw$text` to plain text documents? Come to think of it, when I look at the line `positive,Who's your hero? Nominate them` it's not shown in double quotes (so as to escape the single quote). Could that be the problem? – user1477388 Jun 13 '14 at 13:06
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/55570/discussion-between-mrflick-and-user1477388). – MrFlick Jun 13 '14 at 13:10

4 Answers4

125

It seems this would have worked just fine in tm 0.5.10 but changes in tm 0.6.0 seems to have broken it. The problem is that the functions tolower and trim won't necessarily return TextDocuments (it looks like the older version may have automatically done the conversion). They instead return characters and the DocumentTermMatrix isn't sure how to handle a corpus of characters.

So you could change to

corpus_clean <- tm_map(news_corpus, content_transformer(tolower))

Or you can run

corpus_clean <- tm_map(corpus_clean, PlainTextDocument)

after all of your non-standard transformations (those not in getTransformations()) are done and just before you create the DocumentTermMatrix. That should make sure all of your data is in PlainTextDocument and should make DocumentTermMatrix happy.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • Why do the authors of tm keep breaking backward-compatibility? This just happened again less than a month ago with readerControl – wordsforthewise Dec 17 '17 at 04:05
9

I have found a way to solve this problem in an article about TM.

An example in which the error follows below:

getwd()
require(tm)
files <- DirSource(directory="texts/", encoding="latin1") # import files
corpus <- VCorpus(x=files) # load files, create corpus

summary(corpus) # get a summary
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,stripWhitespace)
corpus <- tm_map(corpus,removePunctuation);
matrix_terms <- DocumentTermMatrix(corpus)

Warning messages:

In TermDocumentMatrix.VCorpus(x, control) : invalid document identifiers

This error occurs because you need an object of the class Vector Source to do your Term Document Matrix, but the previous transformations transform your corpus of texts in character, therefore, changing a class which is not accepted by the function.

However, if you add the function content_transformer inside the tm_map command you may not need even one more command before using the function TermDocumentMatrix to keep going.

The code below changes the class (see second last line) and avoids the error:

getwd()
require(tm)
files <- DirSource(directory="texts/", encoding="latin1")
corpus <- VCorpus(x=files) # load files, create corpus

summary(corpus) # get a summary
corpus <- tm_map(corpus,content_transformer(removePunctuation))
corpus <- tm_map(corpus,content_transformer(stripWhitespace))
corpus <- tm_map(corpus,content_transformer(removePunctuation))
corpus <- Corpus(VectorSource(corpus)) # change class 
matrix_term <- DocumentTermMatrix(corpus)
hongsy
  • 1,498
  • 1
  • 27
  • 39
5

Change this:

corpus_clean <- tm_map(news_corpus, tolower)

For this:

corpus_clean <- tm_map(news_corpus, content_transformer(tolower))
Nizam
  • 4,569
  • 3
  • 43
  • 60
Renmelcon
  • 67
  • 1
  • 1
0

This should work.

remove.packages(tm)
install.packages("http://cran.r-project.org/bin/windows/contrib/3.0/tm_0.5-10.zip",repos=NULL)
library(tm)
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
gopal
  • 9
  • 2