In R tm package, build corpus FROM Document-Term-Matrix

Question

It's straightforward to build a document-term matrix from a corpus with the tm package. I'd like to build a corpus from a document-term-matrix.

Let M be the number of documents in a document set. Let V be the number of terms in the vocabulary of that document set.Then a document-term-matrix is an M*V matrix.

I also have a vocabulary vector, of length V. In the vocabulary vector are the words represented by indices in the document-term-matrix.

From the dtm and vocabulary vector, I'd like to construct a "corpus" object. This is because I'd like to stem my document set. I built my dtm and vocab manually - i.e. there never was a tm "corpus" object representing my dataset, so I can't use the function,

tm_map(corpus, stemDocument, language="english")

I've been trying to build a workaround where I stem the vocabulary and only keep unique words, but then it gets somewhat complicated trying to maintain the correspondence between the dtm and the vocabulary vector.

Ideally, the end result would be that my vocabulary vector is stemmed and only contains unique entries, and the dtm indices correspond to the stemmed vocabulary vector. If you can think of some other way to do that, I would appreciate that as well.

My troubles would be fixed if I could simply build a tm "corpus" from my dtm and vocabulary vector, stem the corpus, and then convert back to a dtm and vocabulary vector (I already know how to make those conversions).

Let me know if I can clarify the problem any further.

You have not provided a [minimal, reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so it's tough to offer specific help. Ideally put a sample object together that is representative of your data and we test different solutions to transform it. It seems unlikely that you would have to go back to a corpus given that the stemming functions should work on any vector of character values. — MrFlick, Jun 25 '14 at 21:45
Thanks, @MrFlick. Duly noted that I should always provide a minimal, reproducible example. My workaround with stemming the vocabulary vector is messy, but I will post a MRE and update ASAP. — sinwav, Jun 25 '14 at 22:47

score 5 · Accepted Answer · edited May 23 '17 at 12:09

5

Here's on approach providing my own minimal reproducible example (as a new user you may not be aware that this is your responsibility) from the tm package:

## Minimal Reproducible Example
library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude,
    control = list(weighting =
    function(x)
        weightTfIdf(x, normalize = FALSE),
        stopwords = TRUE))

## Convert tdm to a list of text
dtm2list <- apply(dtm, 1, function(x) {
    paste(rep(names(x), x), collapse=" ")
})

## convert to a Corpus
myCorp <- VCorpus(VectorSource(dtm2list))
inspect(myCorp)

## Stemming
myCorp <- tm_map(myCorp, stemDocument)
inspect(myCorp)

edited May 23 '17 at 12:09

Community

1
1

answered Jun 25 '14 at 21:47

Tyler Rinker

108,132
65
322
519

1

Thanks, Tyler. It looks like this will work. I will double check it on my project and get back to you. Also, I see now that I should always provide a minimal, reproducible example. Will do that in the future and update this post with one if I need more help. – sinwav Jun 25 '14 at 22:48
1

This did it. Thanks for the help and for notifying me of this expectation on SO. – sinwav Jun 26 '14 at 19:58
With this method, meta datas are lost. How can we keep safe those? – Jan 30 '16 at 09:34
@CeylanB. How did you store meta data in a DocumentTermMatrix? As far as know it does not store meta data so there is nothing to keep safe. This question was about going from a DocumentTermMatrix to a Corpus. The DocumentTermMatrix is a simple triplet matrix which doesn't store meta data to my knowledge. – Tyler Rinker Jan 30 '16 at 13:21
@Tyler Rinker You are right but i mean, convert Corpus to Document Term Matrix and then again Corpus.. – Jan 30 '16 at 21:15
That original corpus was for demo purposes. I'm not understanding why you'd want to go from corpus to dtm and back to corpus. – Tyler Rinker Jan 30 '16 at 21:17

In R tm package, build corpus FROM Document-Term-Matrix

1 Answers1

Linked