I am trying to use the R tm Package in order to solve a String Comparison problem (one-word string, not multi-word text). I have already used the Levenshtein distance which could give me a meaningful result in these terms, but I am not fully satisfied. I am now trying with Cosine Similarity after reading an article which I found interesting.
I have studied the documentation and I read some article but at this point in time, I think I have not understood the algorithm capabilities.
I am able to use it when I have terms as words.
e.g.
docs <- c ("open letters", "closed letters", "letters)
terms <- "open", "closed", "letters")
But I am not able to ask the system to treat every single letter
c ("a", "b", "c", "d")
That would lead to having a string comparison using the Term Document Matrix. But maybe there is already my mistake.
What would it be to implement in tm a single word string comparison?
Thank you for your help, P.s. I have not posted code because it is a more general question but I can create an example in case.
Nicola
Here is the working code as per suggestion:
doc <- c( "closed door", "Open door", "door", "doo", "oor", "house" )
doc_corpus <- Corpus( VectorSource(doc) )
control_list <- list(removePunctuation = TRUE, tolower = TRUE)
tdm <- DocumentTermMatrix(doc_corpus, control = character_tokenize(doc))
tf <- as.matrix(tdm)