R tm package vcorpus: Error in converting corpus to data frame

Question

I am using the tm package to clean up some data using the following code:

mycorpus <- Corpus(VectorSource(x))
mycorpus <- tm_map(mycorpus, removePunctuation)

I then want to convert the corpus back into a data frame in order to export a text file that contains the data in the original format of a data frame. I have tried the following:

dataframe <- as.data.frame(mycorpus)

But this returns an error:

"Error in as.data.frame.default.(mycorpus) : cannot coerce class "c(vcorpus, > corpus")" to a data.frame

How can I convert a corpus into a data frame?

`library(qdap); as.data.frame(mycorpus)` may be of use. – Tyler Rinker Jul 11 '14 at 19:08 — Tyler Rinker, Jul 11 '14 at 19:08

score 25 · Accepted Answer · edited Jan 12 '18 at 01:09

Your corpus is really just a character vector with some extra attributes. So it's best to convert it to character, then you can save that to a data.frame like so:

library(tm)
x <- c("Hello. Sir!","Tacos? On Tuesday?!?")
mycorpus <- Corpus(VectorSource(x))
mycorpus <- tm_map(mycorpus, removePunctuation)

dataframe <- data.frame(text=unlist(sapply(mycorpus, `[`, "content")), 
    stringsAsFactors=F)

which returns

              text
1        Hello Sir
2 Tacos On Tuesday

UPDATE: With newer version of tm, they seem to have updated the as.list.SimpleCorpus method which really messes with using sapplyand lapply. Now I guess you'd have to use

dataframe <- data.frame(text=sapply(mycorpus, identity), 
    stringsAsFactors=F)

thanks! I see the return as a data.frame that has a list with summary data included in the first instance? (1 list(list(content = "Hello Sir", meta = list(author...) — lmcshane, Jul 11 '14 at 18:54

score 5 · Answer 2 · answered Mar 06 '17 at 18:45

The Corpus classed objected has a content attribute accessible through get:

library("tm")

x <- c("Hello. Sir!","Tacos? On Tuesday?!?")
mycorpus <- Corpus(VectorSource(x))
mycorpus <- tm_map(mycorpus, removePunctuation)

attributes(mycorpus)
# $names
# [1] "content" "meta"    "dmeta"  
# 
# $class
# [1] "SimpleCorpus" "Corpus"      
# 

df <- data.frame(text = get("content", mycorpus))

head(df)
#               text
# 1        Hello Sir
# 2 Tacos On Tuesday

score 3 · Answer 3 · edited Nov 16 '14 at 19:49

3

The older answer posted by MrFlick works only in previous version on tm, I was able to fix it by removing content from the formula.

dataframe<-data.frame(text=unlist(sapply(mycorpus, `[`)), stringsAsFactors=F)

edited Nov 16 '14 at 19:49

Adi Inbar

12,097
13
56
69

answered Nov 16 '14 at 19:23

user4258767

31
1

Odd, I am using tm version 0.6 (on CRAN currently) and Flick's answer works for me. – Tyler Rinker Nov 17 '14 at 20:17
Hello - I get this error in the conversion. Any Idea why? > data.frame(text=unlist(sapply(ccorpus_clean, `[`, "content")), stringsAsFactors=F) Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "try-error" – myloginid Jul 30 '15 at 05:12

score 3 · Answer 4 · answered Aug 27 '17 at 13:25

You can convert to data.frame, sort the most frequent words and plot in a wordcloud!

library(tm)
library("wordcloud")
library("RColorBrewer")

x <- c("Hello. Sir!","Tacos? On Tuesday?!?", "Hello")
mycorpus <- Corpus(VectorSource(x))
mycorpus <- tm_map(mycorpus, removePunctuation)

dtm <- TermDocumentMatrix(mycorpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

#           word freq
#hello     hello    2
#sir         sir    1
#tacos     tacos    1
#tuesday tuesday    1

#plot in a wordcloud
set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

score 0 · Answer 5 · answered Mar 04 '16 at 20:07

This is an alternative approach I've used in my own work with text analytics. Essentially, you refer to your document term matrix as a matrix when converting it into a data frame - after which you can run an additional line that makes your variable names R-friendly.

database <- as.data.frame(as.matrix(mycorpus))

colnames(database) <- make.names(colnames(database))

I'm not sure how (or if) this approach differs from the other answers in terms of output but I find this syntax much more straightforward and simpler to implement. Hope this helps!

score 0 · Answer 6 · answered Dec 15 '20 at 03:07

0

There is now a package called textreg which has a nice function for this:

library(textreg)
df <- data.frame(text = convert.tm.to.character(mycorpus))

answered Dec 15 '20 at 03:07

wordsforthewise

13,746
5
87
117

R tm package vcorpus: Error in converting corpus to data frame

6 Answers6

Linked