3

I have been using R's text mining package and its really a great tool. I have not found retrieval support or maybe there are functionalities I am missing. How can a simple VSM model be implemented using the R's text mining package?

madhead
  • 31,729
  • 16
  • 153
  • 201
Shivani Rao
  • 71
  • 1
  • 2
  • 5

2 Answers2

1
# Sample R commands in support of my previous answer
require(fortunes)
require(tm)
sentences <- NULL
for (i in 1:10) sentences <- c(sentences,fortune(i)$quote)
d <- data.frame(textCol =sentences )
ds <- DataframeSource(d)
dsc<-Corpus(ds)
dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = TRUE))
dictC <- Dictionary(dtm)
# The query below is created from words in fortune(1) and fortune(2)
newQry <- data.frame(textCol = "lets stand up and be counted seems to work undocumented")
newQryC <- Corpus(DataframeSource(newQry))
dtmNewQry <- DocumentTermMatrix(newQryC, control = list(weighting=weightTf,stopwords=TRUE,dictionary=dict1))
dictQry <- Dictionary(dtmNewQry)
# Below does a naive similarity (number of features in common)
apply(dtm,1,function(x,y=dictQry){length(intersect(names(x)[x!= 0],y))})
harshsinghal
  • 3,720
  • 8
  • 35
  • 32
  • When the Query has no feature in common with the dictionary created for the collection, those features are not mapped. -"dictionary A character vector to be tabulated against. No other terms will be listed in the result. Terms from the dictionary not occurring in the document at all will be skipped for performance reasons. Defaults to no action (i.e., all terms are considered). - /library/tm/html/termFreq.html ". I think I had not thought of this earlier. – harshsinghal Nov 02 '10 at 16:16
  • dtmNewQry <- DocumentTermMatrix(newQryC, control = list(weighting=weightTf,stopwords=TRUE,dictionary=dict1)) the above line of code produces error please use dictionary=dictC instead. – Shreyas Karnik Nov 03 '10 at 21:29
0

Assuming VSM = Vector Space Model, you can go about a simple retrieval system in the following manner:

  • Create a Document Term Matrix of your collection/corpus
  • Create a function for your similarity measure (Jaccard, Euclidean, etc.). There are packages available with these functions. RSiteSearch should help in finding them.
  • Convert your query to a Document Term Matrix (which will have 1 row and is mapped using the same dictionary as used for the first step)
  • Compute similarity with the query and the matrix from the first step.
  • Rank the results and choose the top n.

A non-R method is to use the GINI index on a text column (rows are documents) of a table in PostgreSQL. Using the ts_vector querying methods, you can have a very fast retrieval system.

harshsinghal
  • 3,720
  • 8
  • 35
  • 32
  • vegan::vegdist has a number of similarity indices not provided by stats package (at least I can't see them). – Roman Luštrik Nov 01 '10 at 18:56
  • I have a problem with implementing step 3 in R. I do not find functions that would let me do that in R. I do know how VSM works and what you have given here is a very broad naive answer. Although I appreciate the answer, I need R commands and libraries that would let me do the above, especially step 3 – Shivani Rao Nov 02 '10 at 04:00
  • Please check http://www.logic.at/staff/feinerer/publications/talks/237_GfKl_2006.pdf for more robust text clustering. – Shreyas Karnik Nov 04 '10 at 16:10