R tm package cosine similary

Question

I am trying to use the R tm Package in order to solve a String Comparison problem (one-word string, not multi-word text). I have already used the Levenshtein distance which could give me a meaningful result in these terms, but I am not fully satisfied. I am now trying with Cosine Similarity after reading an article which I found interesting.

I have studied the documentation and I read some article but at this point in time, I think I have not understood the algorithm capabilities.

I am able to use it when I have terms as words.

e.g.

docs <- c ("open letters", "closed letters", "letters)
terms <- "open", "closed", "letters")

But I am not able to ask the system to treat every single letter c ("a", "b", "c", "d")

That would lead to having a string comparison using the Term Document Matrix. But maybe there is already my mistake.

What would it be to implement in tm a single word string comparison?

Thank you for your help, P.s. I have not posted code because it is a more general question but I can create an example in case.

Nicola

Here is the working code as per suggestion:

doc <- c( "closed door", "Open door", "door", "doo", "oor", "house" )
doc_corpus <- Corpus( VectorSource(doc) )
control_list <- list(removePunctuation = TRUE, tolower = TRUE)
tdm <- DocumentTermMatrix(doc_corpus,  control = character_tokenize(doc))
tf <- as.matrix(tdm)

Please share some sample data. This is also unclear **"But I am not able to ask to the system to treat every single letter c ("a", "b", "c", "d")".** Do you want to do stemming? — NelsonGon, Mar 15 '19 at 14:38
It is not clear what you want. The cosine similarity compares texts by determining how similar the vocabulary is. If you want something more fine-grained (like letters), use the Levenshtein distance. Here is a simple example of cosine similarity: https://stackoverflow.com/a/1750187/5028841 — JBGruber, Mar 15 '19 at 14:43
below in a separate comment I put a clarification on my intent — Nicola, Mar 15 '19 at 15:28
What you need to do is tokenize the text to single characters instead of words (the more standard token). Something like `character_tokenize <- function(x) strsplit(x, split = "")` as your tokenization function — emilliman5, Mar 15 '19 at 16:28
Thank you, emilliman5 worked quite well. Here is the final code: library(tm) doc <- c( "closed door", "Open door", "door", "doo", "oor", "house" ) doc_corpus <- Corpus( VectorSource(doc) ) control_list <- list(removePunctuation = TRUE, tolower = TRUE) tdm <- DocumentTermMatrix(doc_corpus, control = character_tokenize(doc)) tf <- as.matrix(tdm) — Nicola, Mar 18 '19 at 11:00

score -1 · Answer 1 · answered Mar 15 '19 at 14:58

This is what I have understood I can do. Give a document, in my case "doc" vector of strings, the system will provide me the TDM matrix where the terms will be 1 if the match is fully activated (e.g. closed -> closed door) but door will not match with oor.

Example:

library(tm)
doc <- c( "closed door", "Open door", "door", "doo", "oor", "house" )
doc_corpus <- Corpus( VectorSource(doc) )
control_list <- list(removePunctuation = TRUE, tolower = TRUE)
tdm <- TermDocumentMatrix(doc_corpus,  control_list)
tf <- as.matrix(tdm)

The point is that I have read I could also do it something like this, where the terms are the single letters, and I would like to confirm if this is a possibility

so to have a TDM to build then cosine distance to calculate the distance between to strings. But I could not find anything into the documentation.

Thank you for you help, Nicola

Please add this to your original post not as an answer – emilliman5 Mar 15 '19 at 16:25 — emilliman5, Mar 15 '19 at 16:25

R tm package cosine similary

1 Answers1