2

I'm producing a DocumentTermMatrix using tm on a corpus, using only terms which occur quite frequently. (ie MinDocFrequency=50)

Now I want to produce a DTM with a different corpus, but counting exactly the same terms as the previous one, no extra and no fewer. (to cross-validate)

If I use the same method to produce the DTM as with the first corpus, I end up including either more or less terms, or just different ones because they're at a different frequency to the original corpus.

How can I go about doing this? I need to specify which terms to count somehow, but I don't know how.

Thanks to anyone who can point me in the right direction,

-N

EDIT: I was asked for a reproducible example, so I've pasted some example code here http://pastebin.com/y3FDHbYS Re-edit:

 require(tm)
 text <- c('saying text is good',
          'saying text once and saying text twice is better',
          'saying text text text is best',
          'saying text once is still ok',
          'not saying it at all is bad',
          'because text is a good thing',
          'we all like text',
          'even though sometimes it is missing')

validationText <- c("This has different words in it.",
                     "But I still want to count",
                     "the occurence of text",
                     "for example")

TextCorpus <- Corpus(VectorSource(text))
ValiTextCorpus <- Corpus(VectorSource(validationText))

Control = list(stopwords=TRUE, removePunctuation=TRUE, removeNumbers=TRUE, MinDocFrequency=5)

TextDTM = DocumentTermMatrix(TextCorpus, Control)
ValiTextDTM = DocumentTermMatrix(ValiTextCorpus, Control)

This, however just shows the method I'm already familiar with for producing a corpus, and as a result the two DTMs, (TextDTM and ValiTextDTM) contain different terms. What I'm trying to achieve is counting the same terms in both corpuses, even if they are much less frequent in the validation one. In the example then, I'd be trying to count occurrences of the word "text", even though this would produce a very sparse matrix in the validation case.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
N. McA.
  • 4,796
  • 4
  • 35
  • 60
  • Hi there! Please make your post reproducible by having a look at [**How to make a great reproducible example**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for us to help you. Thank you. – Arun Mar 23 '13 at 16:56
  • Also, please do show us what you've done so far. – Arun Mar 23 '13 at 16:56
  • Ok, I've produced an example (sort of), see my edit and paste here http://pastebin.com/y3FDHbYS – N. McA. Mar 23 '13 at 17:14

1 Answers1

4

Here's one approach... does it work for your data? see further down for details that include the OP's data

# load text mining library    
library(tm)

# make first corpus for text mining (data comes from package, for reproducibility) 
data("crude")
corpus1 <- Corpus(VectorSource(crude[1:10]))

# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, 
              stripWhitespace, skipWords, MinDocFrequency=5)
crude1 <- tm_map(corpus1, FUN = tm_reduce, tmFuns = funcs)
crude1.dtm <- TermDocumentMatrix(crude1, control = list(wordLengths = c(3,10))) 

# prepare 2nd corpus
corpus2 <- Corpus(VectorSource(crude[11:20]))

# process text as above
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
crude2 <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
crude2.dtm <- TermDocumentMatrix(crude1, control = list(wordLengths = c(3,10))) 

crude2.dtm.mat <- as.matrix(crude2.dtm)

# subset second corpus by words in first corpus
crude2.dtm.mat[rownames(crude2.dtm.mat) %in% crude1.dtm.freq, ]
    Docs
 Terms    reut-00001.xml reut-00002.xml reut-00004.xml reut-00005.xml reut-00006.xml
 oil                 5             12              2              1              1
 opec                0             15              0              0              0
 prices              3              5              0              0              0
    Docs
Terms    reut-00007.xml reut-00008.xml reut-00009.xml reut-00010.xml reut-00011.xml
oil                 7              4              3              5              9
opec                8              1              2              2              6
prices              5              1              2              1              9

UPDATE after data provided and comments I think this a bit closer to your question.

Here's the same process using document term matrices instead of TDMs (as I used above, a slight variation):

# load text mining library    
library(tm)

# make corpus for text mining (data comes from package, for reproducibility) 
data("crude")
corpus1 <- Corpus(VectorSource(crude[1:10]))

# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
crude1 <- tm_map(corpus1, FUN = tm_reduce, tmFuns = funcs)
crude1.dtm <- DocumentTermMatrix(crude1, control = list(wordLengths = c(3,10))) 


corpus2 <- Corpus(VectorSource(crude[11:20]))

# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, 
              stripWhitespace, skipWords, MinDocFrequency=5)
crude2 <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
crude2.dtm <- DocumentTermMatrix(crude1, control = list(wordLengths = c(3,10))) 

crude2.dtm.mat <- as.matrix(crude2.dtm)
crude2.dtm.mat[,colnames(crude2.dtm.mat) %in% crude1.dtm.freq ]

Terms
Docs             oil opec prices
reut-00001.xml   5    0      3
reut-00002.xml  12   15      5
reut-00004.xml   2    0      0
reut-00005.xml   1    0      0
reut-00006.xml   1    0      0
reut-00007.xml   7    8      5
reut-00008.xml   4    1      1
reut-00009.xml   3    2      2
reut-00010.xml   5    2      1
reut-00011.xml   9    6      9

And here's a solution using the data added into the OP's question

text <- c('saying text is good',
          'saying text once and saying text twice is better',
          'saying text text text is best',
          'saying text once is still ok',
          'not saying it at all is bad',
          'because text is a good thing',
          'we all like text',
          'even though sometimes it is missing')

validationText <- c("This has different words in it.",
                    "But I still want to count",
                    "the occurence of text",
                    "for example")

TextCorpus <- Corpus(VectorSource(text))
ValiTextCorpus <- Corpus(VectorSource(validationText))

Control = list(stopwords=TRUE, removePunctuation=TRUE, removeNumbers=TRUE, MinDocFrequency=5)

TextDTM = DocumentTermMatrix(TextCorpus, Control)
ValiTextDTM = DocumentTermMatrix(ValiTextCorpus, Control)

# find high frequency terms in TextDTM
(TextDTM.hifreq <- findFreqTerms(TextDTM, 5))
[1]   "saying"    "text"     

# find out how many times each high freq word occurs in TextDTM
TextDTM.mat <- as.matrix(TextDTM)
colSums(TextDTM.mat[,TextDTM.hifreq])
saying   text 
6        9

Here are the key lines, subset the second DTM based on the list of high-frequency words from the first DTM. In this case I've used the intersect function since the vector of high frequency words includes a word that is not in the second corpus at all (and intersect seems to handle that better than %in%)

# now look into second DTM
ValiTextDTM.mat <- as.matrix(ValiTextDTM)
common <- data.frame(ValiTextDTM.mat[, intersect(colnames(ValiTextDTM.mat), TextDTM.hifreq) ])
names(common) <- intersect(colnames(ValiTextDTM.mat), TextDTM.hifreq)
     text
1    0
2    0
3    1
4    0

How to find the total count of the high freq word(s) in the second corpus:

colSums(common)
text 
   1
Ben
  • 41,615
  • 18
  • 132
  • 227
  • In what this answers the question ( PS: I will not dwonvote if anybody will profit of my comment to dwonvote), does the second corpus has the same Terms as the first one? – agstudy Mar 23 '13 at 17:23
  • Cheers Ben, I managed to get something out of this. Your exact code however is not really that helpful because your DTMs are actually TDMs :P – N. McA. Mar 23 '13 at 17:38
  • 2
    I'll accept this because it led me to the right answer, but it'd be worth highlighting the last line because that's the key bit. Thanks – N. McA. Mar 23 '13 at 18:58
  • @N.McA. It would be better to upvote this answer and to post yours a new one as good answer. – agstudy Mar 23 '13 at 22:14
  • Thanks for the comments, I've edited my answer to make it better match the question. – Ben Mar 24 '13 at 13:02