1

I wonder how can I get the term frequency weight in tm packge which is (tf=term/total terms in the document)`

MyMatrix <- DocumentTermMatrix(a, control = list(weight= weightTf))

After I use this weight it shows the frequency of term not TF weight like this

Doc(1)  1   0   0   3   0   0   2
Doc(2)  0   0   0   0   0   0   0
Doc(3)  0   5   0   0   0   0   1
Doc(4)  0   0   0   2   2   0   0
Doc(5)  0   4   0   0   0   0   1
Doc(6)  5   0   0   0   1   0   0
Doc(7)  0   5   0   0   0   0   0
Doc(8)  0   0   0   1   0   0   7
TylerH
  • 20,799
  • 66
  • 75
  • 101
  • I know it is not the tm package, but I like to use the tidytext package. `bind_tf_idf ` is the function you could use. The following blog entry from the author gives a nice overview on the package http://juliasilge.com/blog/Life-Changing-Magic/ – PhiSeu Sep 12 '16 at 10:38
  • 1
    You use the option `weighting` not `weight` – Hack-R Sep 12 '16 at 10:42
  • 1
    Possible duplicate of [Trying to get tf-idf weighting working in R](http://stackoverflow.com/questions/14820590/trying-to-get-tf-idf-weighting-working-in-r) – Hack-R Sep 12 '16 at 10:43

2 Answers2

1

For example

library(tm)
corp <- Corpus(VectorSource(c(doc1="hello world", doc2="hello new world")))
myfun <-  WeightFunction(function(m) { 
  cs <- slam::col_sums(m) 
  m$v <- m$v/cs[m$j] 
  return(m) 
}, "Term Frequency by Total Document Term Frequency", "termbytot") 
dtm <- DocumentTermMatrix(corp, control = list(weighting = myfun))
inspect(dtm)
# <<DocumentTermMatrix (documents: 2, terms: 3)>>
# Non-/sparse entries: 5/1
# Sparsity           : 17%
# Maximal term length: 5
# 
#     Terms
# Docs     hello       new     world
#    1 0.5000000 0.0000000 0.5000000
#    2 0.3333333 0.3333333 0.3333333
lukeA
  • 53,097
  • 5
  • 97
  • 100
0

Something like MyMatrix / rowSums(MyMatrix) should give you the desired result.

But if a document has no terms (DTM has all zeros for the document) the above will result in a row of NaNs as follows (as in your case)

Doc(1) 0.1111111   0   0 0.5555556 0.1111111 0.2222222 0.0000000
Doc(2) 0.0000000   1   0 0.0000000 0.0000000 0.0000000 0.0000000
Doc(3)       NaN NaN NaN       NaN       NaN       NaN       NaN
Doc(4) 1.0000000   0   0 0.0000000 0.0000000 0.0000000 0.0000000
Doc(5) 0.0000000   0   0 0.0000000 0.2857143 0.5714286 0.1428571

So, a better approach is:

t(apply(myMatrix, 1, function(x) if(sum(x) != 0) x / sum(x) else x))

with the desired result:

Doc(1) 0.1111111  0  0 0.5555556 0.1111111 0.2222222 0.0000000
Doc(2) 0.0000000  1  0 0.0000000 0.0000000 0.0000000 0.0000000
Doc(3) 0.0000000  0  0 0.0000000 0.0000000 0.0000000 0.0000000
Doc(4) 1.0000000  0  0 0.0000000 0.0000000 0.0000000 0.0000000
Doc(5) 0.0000000  0  0 0.0000000 0.2857143 0.5714286 0.1428571
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
  • Please provide example code and explain how this would help – Sean Reddy Sep 12 '16 at 10:52
  • This approach causes cannot allocate vector of size 489 Kb – user3655888 Sep 12 '16 at 12:32
  • I guess as.matrix(myMatrix) is having the memory issue: please refer to http://stackoverflow.com/questions/6860715/converting-a-document-term-matrix-into-a-matrix-with-lots-of-data-causes-overflo and use myMatrix=as.big.matrix(x=as.matrix(myMatrix)). – Sandipan Dey Sep 13 '16 at 05:16