6

I want to compare two texts to similarity, therefore i need a simple function to list clearly and chronologically the words and phrases occurring in both texts. these words/sentences should be highlighted or underlined for better visualization)

on the base of @joris Meys ideas, i added an array to divide text into sentences and subordinate sentences.

this is how it looks like:

  textparts <- function (text){
  textparts <- c("\\,", "\\.")
  i <- 1
  while(i<=length(textparts)){
        text <- unlist(strsplit(text, textparts[i]))
        i <- i+1
  }
  return (text)
}

textparts1 <- textparts("This is a complete sentence, whereas this is a dependent clause. This thing works.")
textparts2 <- textparts("This could be a sentence, whereas this is a dependent clause. Plagiarism is not cool. This thing works.")

  commonWords <- intersect(textparts1, textparts2)
  commonWords <- paste("\\<(",commonWords,")\\>",sep="")


  for(x in commonWords){
    textparts1 <- gsub(x, "\\1*", textparts1,ignore.case=TRUE)
    textparts2 <- gsub(x, "\\1*", textparts2,ignore.case=TRUE)
  }
  return(list(textparts1,textparts2))

However, sometimes it works, sometimes it doesn't.

I WOULD like to have results like these:

>   return(list(textparts1,textparts2))
[[1]]
[1] "This is a complete sentence"         " whereas this is a dependent clause*" " This thing works*"                  

[[2]]
[1] "This could be a sentence"            " whereas this is a dependent clause*" " Plagiarism is not cool"             " This thing works*"           

whereas i get none results.

digitalaxp
  • 91
  • 1
  • 2
  • 5
  • Have a look at the CRAN taskview for natural language processing for inspiration: http://cran.r-project.org/web/views/NaturalLanguageProcessing.html – Andrie May 26 '11 at 10:07
  • thx, i know about this site. There is a sentence detector, which might be useful for me, but there is none sentence-compared highlighting what i'm looking for – digitalaxp May 26 '11 at 10:28
  • hehe, now you make it confusing for the people. You can add comments if a solution is not what you're looking for, but you shouldn't change the content of your question. This is a completely different question. Second, you're not coding C, you're coding R. That while() construct is hideous. do `for(i in textparts)` and drop all the rest. And then you should take a look at what goes on. You might have spaces here and there, messing up your result. You might have differences uppercase-lowercase. Check the help files and check your intermediate results, and you'll solve it. – Joris Meys May 31 '11 at 07:51

2 Answers2

7

There are some problems with the answer of @Chase :

  • differences in capitalization are not taken into account
  • interpunction can mess up results
  • if there is more than one word similar, then you get a lot of warnings due to the gsub call.

Based on his idea, there is the following solution that makes use of tolower() and some nice functionalities of regular expressions :

compareSentences <- function(sentence1, sentence2) {
  # split everything on "not a word" and put all to lowercase
  x1 <- tolower(unlist(strsplit(sentence1, "\\W")))
  x2 <- tolower(unlist(strsplit(sentence2, "\\W")))

  commonWords <- intersect(x1, x2)
  #add word beginning and ending and put words between ()
  # to allow for match referencing in gsub
  commonWords <- paste("\\<(",commonWords,")\\>",sep="")


  for(x in commonWords){ 
    # replace the match by the match with star added
    sentence1 <- gsub(x, "\\1*", sentence1,ignore.case=TRUE)
    sentence2 <- gsub(x, "\\1*", sentence2,ignore.case=TRUE)
  }
  return(list(sentence1,sentence2))      
}

This gives following result :

text1 <- "This is a test. Weather is fine"
text2 <- "This text is a test. This weather is fine. This blabalba This "

compareSentences(text1,text2)
[[1]]
[1] "This* is* a* test*. Weather* is* fine*"

[[2]]
[1] "This* text is* a* test*. This* weather* is* fine*. This* blabalba This* "
Joris Meys
  • 106,551
  • 31
  • 221
  • 263
  • great thanks. 1 more question: this is very useful for 1 or two sentences. i would like to analyze texts with 10-15 sentences. In other words, it would be better to search for trigrams. – digitalaxp May 30 '11 at 13:46
  • 1
    @digitalaxp : you have all the building blocks. See `?regex` and `?strsplit` for example. This isn't difficult, but SO is not hire-a-coder-for-free. – Joris Meys May 30 '11 at 14:20
  • @joris-meys sorry, my fault! anyway, i tried it on my own and edited my first post. – digitalaxp May 30 '11 at 21:43
4

I am sure that there are far more robust functions on the natural language processing page, but here's one solution using intersect() to find the common words. The approach is to read in the two sentences, identify the common words and gsub() them with a combination of the word and a moniker of our choice. Here I chose to use *, but you could easily change that, or add something else.

sent1 <- "I shot the sheriff."
sent2 <- "Dick Cheney shot a man."

compareSentences <- function(sentence1, sentence2) {
  sentence1 <- unlist(strsplit(sentence1, " "))
  sentence2 <- unlist(strsplit(sentence2, " "))

  commonWords <- intersect(sentence1, sentence2)

  return(list(
      sentence1 = paste(gsub(commonWords, paste(commonWords, "*", sep = ""), sentence1), collapse = " ")
    , sentence2 = paste(gsub(commonWords, paste(commonWords, "*", sep = ""), sentence2), collapse = " ")
    ))
}

> compareSentences(sent1, sent2)
$sentence1
[1] "I shot* the sheriff."

$sentence2
[1] "Dick Cheney shot* a man."
Chase
  • 67,710
  • 18
  • 144
  • 161