I am unable to use tm_combine in R. Here are the version details
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 3.3
year 2017
month 03
day 06
svn rev 72310
language R
version.string R version 3.3.3 (2017-03-06)
nickname Another Canoe
I would like to understand more on this. In case there is an issue with accessing this, my question is how do I combine two Document Term matrices D1 and D2 which have different number of columns?
> packageVersion("tm")
[1] ‘0.7.1’
> dim(s.tdm)
[1] 132 536
> dim(f.tdm)
[1] 132 674
>
Thanks.
Here's the code that I was trying
library(tm)
library(SnowballC)
s.dir <- "AuthorIdentify\\Author1"
f.dir <- "AuthorIdentify\\Author2"
s.docs <- Corpus(DirSource(s.dir, encoding="UTF-8"))
f.docs <- Corpus(DirSource(f.dir, encoding="UTF-8"))
cleanCorpus<-function(corpus){
# apply stemming
corpus <-tm_map(corpus, stemDocument)
# remove punctuation
corpus.tmp <- tm_map(corpus,removePunctuation)
# remove white spaces
corpus.tmp <- tm_map(corpus.tmp,stripWhitespace)
# remove stop words
corpus.tmp <-
tm_map(corpus.tmp,removeWords,stopwords("en"))
return(corpus.tmp)
}
s.cldocs <- cleanCorpus(s.docs) # preprocessing
# forms document-term matrix
s.tdm <- DocumentTermMatrix(s.cldocs)
# removes infrequent terms
s.tdm <- removeSparseTerms(s.tdm,0.97)
dim(s.tdm) # [ #docs, #numterms ]
f.cldocs <- cleanCorpus(f.docs) # preprocessing
# forms document-term matrix
f.tdm <- DocumentTermMatrix(f.cldocs)
# removes infrequent terms
f.tdm <- removeSparseTerms(f.tdm,0.97)
dim(f.tdm) # [ #docs, #numterms ]
#how do I combine f.tdm and s.tdm
tm_combine???
I need to combine them (and eventually to a matrix or data.frame) so that I can have a column identifier for Author1 or Author2
With the approach referenced in the linked article, the output of the combined DTMs does not match the expected output. I have referenced the relevant details in the comments section.