0

I have one word document contains 100 pages and want to detect duplicate sentences. Is there any way to automatically do this in R?

1- convert to a txt file 2-read:

     tx=readLines("C:\\Users\\paper-2013.txt")
bartektartanus
  • 15,284
  • 6
  • 74
  • 102
sacvf
  • 2,463
  • 5
  • 36
  • 54
  • You probably ought to convert it to a text file first, to avoid special characters and formatting. – ilir Apr 17 '14 at 11:25
  • 4
    Not completely automated that I know of, but you could: 1) convert doc to txt, 2) import into R with `readLines`, 3) split into sentences by using `strsplit` on the periods, 4) remove extra whitespace with `gsub`, 5) use `duplicated` – BrodieG Apr 17 '14 at 11:25
  • @BrodieG beat me to it :-) I'll wait for your answer. – ilir Apr 17 '14 at 11:35
  • 1
    Do upper vs. lower case differences count? You probably need several 'cleanup' operations as BrodieG suggested. – Carl Witthoft Apr 17 '14 at 11:44
  • 1
    @sacvf, I'll happily do this if you produce a reproducible example, but really you should try taking the steps I outlined first with your data and see if you can figure it out on your own with help from the R documentation for the functions I mentioned. – BrodieG Apr 17 '14 at 12:21
  • @ilir, feel free to put your version in as an answer if you want to make up the data to use. – BrodieG Apr 17 '14 at 12:22

1 Answers1

4

Here a small code chunk that I have used previously, which is loosely based on Matloff's The Art of R Programming, where he used sth. similar as an example:

 sent <- "This is a sentence. Here comes another sentence. This is a sentence. This is a sentence. Sentence after sentence. This is two sentences."

You can split every sentence when there are full stops using strsplit:

 out <- strsplit(sent, ".", fixed=T)
 library(gdata)
 out <- trim(out) # trims leading white spaces.

Now, this may seem clumsy, but bear with me:

 outlist <- list()
 for(i in 1:length(unlist(out))){
   outlist[[out[[1]][i]]] <- c(outlist[[out[[1]][i] ]],i)
 }

Now you have a list in which every entry is the sentences itself (as name) and the position where the sentence occurs. You can now use length-arguments to see how many sentences are duplicated. But you can also see if there are direct duplicates which helps to distinguish between writing the same sentence twice by mistake (e.g. "My name is R. My name is R."), or coincidentially repeating the same sentence at very different positions in the text without it being a problem (e.g. sentences like "Here is an example." which may exist in your text several times without it being a problem).

 > outlist
 $`This is a sentence`
 [1] 1 3 4
 $`Here comes another sentence`
 [1] 2
 $`Sentence after sentence`
 [1] 5
 $`This is two sentences`
 [1] 6
coffeinjunky
  • 11,254
  • 39
  • 57