0

I am unable to use tm_combine in R. Here are the version details

platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          3.3                         
year           2017                        
month          03                          
day            06                          
svn rev        72310                       
language       R                           
version.string R version 3.3.3 (2017-03-06)
nickname       Another Canoe  

I would like to understand more on this. In case there is an issue with accessing this, my question is how do I combine two Document Term matrices D1 and D2 which have different number of columns?

> packageVersion("tm")
[1] ‘0.7.1’
> dim(s.tdm)
[1] 132 536
> dim(f.tdm)
[1] 132 674
> 

Thanks.

Here's the code that I was trying

library(tm)
library(SnowballC)

s.dir <- "AuthorIdentify\\Author1"
f.dir <- "AuthorIdentify\\Author2"

s.docs <- Corpus(DirSource(s.dir, encoding="UTF-8"))
f.docs <- Corpus(DirSource(f.dir, encoding="UTF-8"))

cleanCorpus<-function(corpus){
  # apply stemming
  corpus <-tm_map(corpus, stemDocument)

  # remove punctuation
  corpus.tmp <- tm_map(corpus,removePunctuation)

  # remove white spaces
  corpus.tmp <- tm_map(corpus.tmp,stripWhitespace)

  # remove stop words
  corpus.tmp <-
    tm_map(corpus.tmp,removeWords,stopwords("en"))

  return(corpus.tmp)
}

s.cldocs <- cleanCorpus(s.docs) # preprocessing

# forms document-term matrix
s.tdm <- DocumentTermMatrix(s.cldocs)

# removes infrequent terms
s.tdm <- removeSparseTerms(s.tdm,0.97)

dim(s.tdm) # [ #docs, #numterms ]

f.cldocs <- cleanCorpus(f.docs) # preprocessing

# forms document-term matrix
f.tdm <- DocumentTermMatrix(f.cldocs)

# removes infrequent terms
f.tdm <- removeSparseTerms(f.tdm,0.97)

dim(f.tdm) # [ #docs, #numterms ]


#how do I combine f.tdm and s.tdm
tm_combine???

I need to combine them (and eventually to a matrix or data.frame) so that I can have a column identifier for Author1 or Author2

With the approach referenced in the linked article, the output of the combined DTMs does not match the expected output. I have referenced the relevant details in the comments section.

user3701522
  • 307
  • 3
  • 12
  • You really need to provide a minimal reproducible example (see this post for help: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) Without a specific error message or even some data we cannot help you. – Phil May 23 '17 at 08:37
  • When I say ?tm_combine in RStudio command area, I do not get this function name in the drop down. My first level question really is about anything that I might be missing since I have the right version of RStudio and tm package – user3701522 May 23 '17 at 08:40
  • Added my code just in case it helps get a better context. – user3701522 May 23 '17 at 08:43
  • I've just checked the latest CRAN version of `tm` and there's no `tm_combine()` function. This answer might help: https://stackoverflow.com/a/25535295/3022126 – Phil May 23 '17 at 08:47
  • 1
    Possible duplicate of [Use tm's Corpus function with big data in R](https://stackoverflow.com/questions/25533594/use-tms-corpus-function-with-big-data-in-r) – Phil May 23 '17 at 08:50
  • Flagged as duplicate of https://stackoverflow.com/questions/25533594/use-tms-corpus-function-with-big-data-in-r I think the initial problems were different, but solutions look the same. – Phil May 23 '17 at 08:51
  • Thanks Phil. But the output of dim after merging is not matching the expected output. Refer https://drive.google.com/file/d/0BzqeP3J9B8lZWjJIRk1JazByT00/edit – user3701522 May 23 '17 at 08:54
  • With the approach of referenced article, the dim is > dim(c(s.tdm, f.tdm)) [1] 264 918, whereas it should be 264, 518. – user3701522 May 23 '17 at 08:55
  • You are working with R, Rstudio is only an IDE. When you do an internet search, consider keeping this in mind. – Roman Luštrik May 23 '17 at 08:55
  • In that case can you provide the minimal reproducible example? Read the answers to the post to provide minimal data set and *minimal* code. There's very little we can do without some example data – Phil May 23 '17 at 09:17

0 Answers0