0

I have created a TermDocumentMatrix that looks something like this:

>inspect(tdm[1:6,1:3])   
Terms       Doc1.txt   Doc2.txt    Doc3.txt
abcd          1           0          0
abandon       0           1          1
qrd           0           0          1
abductor      1           0          0 
plo           1           1          0
man           0           1          0 

I also have a list of words something like:

>dict
abductor
abandon
man
mammoth

Now how do I subset the TermDocumentMatrix rows so that it looks like

Terms       Doc1.txt   Doc2.txt    Doc3.txt
abandon       0           1          1
abductor      1           0          0 

I am only able to check the row names in matrix with the 'dict' list, but I'm unable to subset them

anonymous
  • 405
  • 8
  • 22

1 Answers1

2

You can subset with a vector of words. You didn't include a reproducible example so I'll just use the one from the ?TermDocumentMatrix help page.

library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude,
    control = list(removePunctuation = TRUE,
    stopwords = TRUE))

words<-c("world","zero")
inspect(tdm[words, 1:3])

# <<TermDocumentMatrix (terms: 2, documents: 3)>>
# Non-/sparse entries: 1/5
# Sparsity           : 83%
# Maximal term length: 5
# Weighting          : term frequency (tf)
# 
#        Docs
# Terms   127 144 191
#   world   0   1   0
#   zero    0   0   0

If you don't know which of the words appears in the matrix, you can use

words <- c("world","zero", "xyyzy")
inspect(tdm[words[words %in% Terms(tdm)], 1:3])
Community
  • 1
  • 1
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • if I try words<-c('world','zero','xyyzy') in your code snippet I get an error Error in `[.simple_triplet_matrix`(tdm, words, 1:3) : Subscript out of bounds. I want check the words in the rows of the dataframe with a given list – anonymous Jul 28 '15 at 22:11
  • That was not very clear from your question. I've updated the answer to subset the word list to only those words in the matrix. – MrFlick Jul 28 '15 at 22:22
  • Apologies, but the modified answer works great, thanks – anonymous Jul 28 '15 at 22:28