-1

I am trying to use the R tm Package in order to solve a String Comparison problem (one-word string, not multi-word text). I have already used the Levenshtein distance which could give me a meaningful result in these terms, but I am not fully satisfied. I am now trying with Cosine Similarity after reading an article which I found interesting.

I have studied the documentation and I read some article but at this point in time, I think I have not understood the algorithm capabilities.

I am able to use it when I have terms as words.

e.g.

docs <- c ("open letters", "closed letters", "letters)
terms <- "open", "closed", "letters")

But I am not able to ask the system to treat every single letter c ("a", "b", "c", "d")

That would lead to having a string comparison using the Term Document Matrix. But maybe there is already my mistake.

What would it be to implement in tm a single word string comparison?

Thank you for your help, P.s. I have not posted code because it is a more general question but I can create an example in case.

Nicola

Here is the working code as per suggestion:

doc <- c( "closed door", "Open door", "door", "doo", "oor", "house" )
doc_corpus <- Corpus( VectorSource(doc) )
control_list <- list(removePunctuation = TRUE, tolower = TRUE)
tdm <- DocumentTermMatrix(doc_corpus,  control = character_tokenize(doc))
tf <- as.matrix(tdm)
Roi Danton
  • 7,933
  • 6
  • 68
  • 80
Nicola
  • 37
  • 2
  • Please share some sample data. This is also unclear **"But I am not able to ask to the system to treat every single letter c ("a", "b", "c", "d")".** Do you want to do stemming? – NelsonGon Mar 15 '19 at 14:38
  • It is not clear what you want. The cosine similarity compares texts by determining how similar the vocabulary is. If you want something more fine-grained (like letters), use the Levenshtein distance. Here is a simple example of cosine similarity: https://stackoverflow.com/a/1750187/5028841 – JBGruber Mar 15 '19 at 14:43
  • below in a separate comment I put a clarification on my intent – Nicola Mar 15 '19 at 15:28
  • 1
    What you need to do is tokenize the text to single characters instead of words (the more standard token). Something like `character_tokenize <- function(x) strsplit(x, split = "")` as your tokenization function – emilliman5 Mar 15 '19 at 16:28
  • Thank you, emilliman5 worked quite well. Here is the final code: library(tm) doc <- c( "closed door", "Open door", "door", "doo", "oor", "house" ) doc_corpus <- Corpus( VectorSource(doc) ) control_list <- list(removePunctuation = TRUE, tolower = TRUE) tdm <- DocumentTermMatrix(doc_corpus, control = character_tokenize(doc)) tf <- as.matrix(tdm) – Nicola Mar 18 '19 at 11:00

1 Answers1

-1

This is what I have understood I can do. Give a document, in my case "doc" vector of strings, the system will provide me the TDM matrix where the terms will be 1 if the match is fully activated (e.g. closed -> closed door) but door will not match with oor.

Example:

library(tm)
doc <- c( "closed door", "Open door", "door", "doo", "oor", "house" )
doc_corpus <- Corpus( VectorSource(doc) )
control_list <- list(removePunctuation = TRUE, tolower = TRUE)
tdm <- TermDocumentMatrix(doc_corpus,  control_list)
tf <- as.matrix(tdm)

enter image description here

The point is that I have read I could also do it something like this, where the terms are the single letters, and I would like to confirm if this is a possibility

enter image description here

so to have a TDM to build then cosine distance to calculate the distance between to strings. But I could not find anything into the documentation.

Thank you for you help, Nicola

Nicola
  • 37
  • 2