Simple Comparing of two texts in R

Question

I want to compare two texts to similarity, therefore i need a simple function to list clearly and chronologically the words and phrases occurring in both texts. these words/sentences should be highlighted or underlined for better visualization)

on the base of @joris Meys ideas, i added an array to divide text into sentences and subordinate sentences.

this is how it looks like:

  textparts <- function (text){
  textparts <- c("\\,", "\\.")
  i <- 1
  while(i<=length(textparts)){
        text <- unlist(strsplit(text, textparts[i]))
        i <- i+1
  }
  return (text)
}

textparts1 <- textparts("This is a complete sentence, whereas this is a dependent clause. This thing works.")
textparts2 <- textparts("This could be a sentence, whereas this is a dependent clause. Plagiarism is not cool. This thing works.")

  commonWords <- intersect(textparts1, textparts2)
  commonWords <- paste("\\<(",commonWords,")\\>",sep="")


  for(x in commonWords){
    textparts1 <- gsub(x, "\\1*", textparts1,ignore.case=TRUE)
    textparts2 <- gsub(x, "\\1*", textparts2,ignore.case=TRUE)
  }
  return(list(textparts1,textparts2))

However, sometimes it works, sometimes it doesn't.

I WOULD like to have results like these:

>   return(list(textparts1,textparts2))
[[1]]
[1] "This is a complete sentence"         " whereas this is a dependent clause*" " This thing works*"                  

[[2]]
[1] "This could be a sentence"            " whereas this is a dependent clause*" " Plagiarism is not cool"             " This thing works*"

whereas i get none results.

Have a look at the CRAN taskview for natural language processing for inspiration: http://cran.r-project.org/web/views/NaturalLanguageProcessing.html — Andrie, May 26 '11 at 10:07
thx, i know about this site. There is a sentence detector, which might be useful for me, but there is none sentence-compared highlighting what i'm looking for — digitalaxp, May 26 '11 at 10:28
hehe, now you make it confusing for the people. You can add comments if a solution is not what you're looking for, but you shouldn't change the content of your question. This is a completely different question. Second, you're not coding C, you're coding R. That while() construct is hideous. do `for(i in textparts)` and drop all the rest. And then you should take a look at what goes on. You might have spaces here and there, messing up your result. You might have differences uppercase-lowercase. Check the help files and check your intermediate results, and you'll solve it. — Joris Meys, May 31 '11 at 07:51

score 7 · Answer 1 · answered May 30 '11 at 12:36

There are some problems with the answer of @Chase :

differences in capitalization are not taken into account
interpunction can mess up results
if there is more than one word similar, then you get a lot of warnings due to the gsub call.

Based on his idea, there is the following solution that makes use of tolower() and some nice functionalities of regular expressions :

compareSentences <- function(sentence1, sentence2) {
  # split everything on "not a word" and put all to lowercase
  x1 <- tolower(unlist(strsplit(sentence1, "\\W")))
  x2 <- tolower(unlist(strsplit(sentence2, "\\W")))

  commonWords <- intersect(x1, x2)
  #add word beginning and ending and put words between ()
  # to allow for match referencing in gsub
  commonWords <- paste("\\<(",commonWords,")\\>",sep="")


  for(x in commonWords){ 
    # replace the match by the match with star added
    sentence1 <- gsub(x, "\\1*", sentence1,ignore.case=TRUE)
    sentence2 <- gsub(x, "\\1*", sentence2,ignore.case=TRUE)
  }
  return(list(sentence1,sentence2))      
}

This gives following result :

text1 <- "This is a test. Weather is fine"
text2 <- "This text is a test. This weather is fine. This blabalba This "

compareSentences(text1,text2)
[[1]]
[1] "This* is* a* test*. Weather* is* fine*"

[[2]]
[1] "This* text is* a* test*. This* weather* is* fine*. This* blabalba This* "

great thanks. 1 more question: this is very useful for 1 or two sentences. i would like to analyze texts with 10-15 sentences. In other words, it would be better to search for trigrams. — digitalaxp, May 30 '11 at 13:46
@digitalaxp : you have all the building blocks. See `?regex` and `?strsplit` for example. This isn't difficult, but SO is not hire-a-coder-for-free. — Joris Meys, May 30 '11 at 14:20
@joris-meys sorry, my fault! anyway, i tried it on my own and edited my first post. — digitalaxp, May 30 '11 at 21:43

score 4 · Answer 2 · answered May 26 '11 at 11:09

I am sure that there are far more robust functions on the natural language processing page, but here's one solution using intersect() to find the common words. The approach is to read in the two sentences, identify the common words and gsub() them with a combination of the word and a moniker of our choice. Here I chose to use *, but you could easily change that, or add something else.

sent1 <- "I shot the sheriff."
sent2 <- "Dick Cheney shot a man."

compareSentences <- function(sentence1, sentence2) {
  sentence1 <- unlist(strsplit(sentence1, " "))
  sentence2 <- unlist(strsplit(sentence2, " "))

  commonWords <- intersect(sentence1, sentence2)

  return(list(
      sentence1 = paste(gsub(commonWords, paste(commonWords, "*", sep = ""), sentence1), collapse = " ")
    , sentence2 = paste(gsub(commonWords, paste(commonWords, "*", sep = ""), sentence2), collapse = " ")
    ))
}

> compareSentences(sent1, sent2)
$sentence1
[1] "I shot* the sheriff."

$sentence2
[1] "Dick Cheney shot* a man."

+1 for intersect. I've taken your idea and polished it a bit for a more general solution. — Joris Meys, May 30 '11 at 12:37

Simple Comparing of two texts in R

2 Answers2

Linked