How do I set up TF weight of terms in corpus using the ‘tm’ package in R

Question

I wonder how can I get the term frequency weight in tm packge which is (tf=term/total terms in the document)`

MyMatrix <- DocumentTermMatrix(a, control = list(weight= weightTf))

After I use this weight it shows the frequency of term not TF weight like this

Doc(1)  1   0   0   3   0   0   2
Doc(2)  0   0   0   0   0   0   0
Doc(3)  0   5   0   0   0   0   1
Doc(4)  0   0   0   2   2   0   0
Doc(5)  0   4   0   0   0   0   1
Doc(6)  5   0   0   0   1   0   0
Doc(7)  0   5   0   0   0   0   0
Doc(8)  0   0   0   1   0   0   7

I know it is not the tm package, but I like to use the tidytext package. `bind_tf_idf ` is the function you could use. The following blog entry from the author gives a nice overview on the package http://juliasilge.com/blog/Life-Changing-Magic/ — PhiSeu, Sep 12 '16 at 10:38
Possible duplicate of [Trying to get tf-idf weighting working in R](http://stackoverflow.com/questions/14820590/trying-to-get-tf-idf-weighting-working-in-r) — Hack-R, Sep 12 '16 at 10:43

score 1 · Accepted Answer · answered Sep 12 '16 at 11:10

For example

library(tm)
corp <- Corpus(VectorSource(c(doc1="hello world", doc2="hello new world")))
myfun <-  WeightFunction(function(m) { 
  cs <- slam::col_sums(m) 
  m$v <- m$v/cs[m$j] 
  return(m) 
}, "Term Frequency by Total Document Term Frequency", "termbytot") 
dtm <- DocumentTermMatrix(corp, control = list(weighting = myfun))
inspect(dtm)
# <<DocumentTermMatrix (documents: 2, terms: 3)>>
# Non-/sparse entries: 5/1
# Sparsity           : 17%
# Maximal term length: 5
# 
#     Terms
# Docs     hello       new     world
#    1 0.5000000 0.0000000 0.5000000
#    2 0.3333333 0.3333333 0.3333333

Sandipan Dey · Answer 2 · 2016-09-12T11:27:10.017

0

Something like MyMatrix / rowSums(MyMatrix) should give you the desired result.

But if a document has no terms (DTM has all zeros for the document) the above will result in a row of NaNs as follows (as in your case)

Doc(1) 0.1111111   0   0 0.5555556 0.1111111 0.2222222 0.0000000
Doc(2) 0.0000000   1   0 0.0000000 0.0000000 0.0000000 0.0000000
Doc(3)       NaN NaN NaN       NaN       NaN       NaN       NaN
Doc(4) 1.0000000   0   0 0.0000000 0.0000000 0.0000000 0.0000000
Doc(5) 0.0000000   0   0 0.0000000 0.2857143 0.5714286 0.1428571

So, a better approach is:

t(apply(myMatrix, 1, function(x) if(sum(x) != 0) x / sum(x) else x))

with the desired result:

Doc(1) 0.1111111  0  0 0.5555556 0.1111111 0.2222222 0.0000000
Doc(2) 0.0000000  1  0 0.0000000 0.0000000 0.0000000 0.0000000
Doc(3) 0.0000000  0  0 0.0000000 0.0000000 0.0000000 0.0000000
Doc(4) 1.0000000  0  0 0.0000000 0.0000000 0.0000000 0.0000000
Doc(5) 0.0000000  0  0 0.0000000 0.2857143 0.5714286 0.1428571

edited Sep 12 '16 at 11:27

answered Sep 12 '16 at 10:45

Sandipan Dey

21,482
2
51
63

Please provide example code and explain how this would help – Sean Reddy Sep 12 '16 at 10:52
This approach causes cannot allocate vector of size 489 Kb – user3655888 Sep 12 '16 at 12:32
I guess as.matrix(myMatrix) is having the memory issue: please refer to http://stackoverflow.com/questions/6860715/converting-a-document-term-matrix-into-a-matrix-with-lots-of-data-causes-overflo and use myMatrix=as.big.matrix(x=as.matrix(myMatrix)). – Sandipan Dey Sep 13 '16 at 05:16

How do I set up TF weight of terms in corpus using the ‘tm’ package in R

2 Answers2