2

I am working on classification of some documents and a number of the documents have large sections of similar (and usually irrelevant) text. I would like to identify and remove those similar sections, as I believe I may be able to make a better model.

An example would be proposals by an organization, each of which contains the same paragraph regarding the organization's mission statement and purpose.

A couple points which make it difficult:

  • similar sections are not known ahead of time, making a fixed pattern inappropriate
  • could be located anywhere in the documents, documents do not have consistent structure
  • the pattern could be many characters long, e.g. 3000+ characters
  • I don't want to remove every similar word, just large sections
  • I don't want to identify which strings are similar, rather I want to remove the similar sections.

I've considered regex and looked through some packages like stringr, strdist, and the base functions, but these utilities seem useful if you already know the pattern and the pattern is much shorter, or if the documents have a similar structure. In my case the text could be structured differently and the pattern is not predefined, but rather whatever is similar between the documents.

I considered making and comparing lists of 3000-grams for each document but this didn't seem feasible or easy to implement.

Below is an example of a complete solution, but really I am not even sure how to approach this problem, so information in that direction would be useful as well.

Example code

    doc_a <- "this document discusses african hares in the northern sahara.  african hares
      are the most common land dwelling mammal in the northern sahara.  crocodiles eat
      african hares. this text is from a book written for the foundation for education
      in northern africa."

    doc_b <- "this document discusses the nile. The nile delta is in egypt. the nile is the
      longest river in the world. the nile has lots of crocodiles. crocodiles and
      alligators are different. crocodiles eat african hares. crocodiles are the most common
      land dwelling reptile in egypt. this text is from a book written for the foundation
      for education in northern africa."

    # this function would trim similar sections of 6 or more words in length
    # (length in characters is also acceptable)
    trim_similar(doc_a, doc_b, 6)

Output

    [1] "this document discusses african hares in the northern sahara. african hares
    mammal in the northern sahara. crocodiles eat african hares."  
    [2] "this document discusses the nile. The nile delta is in egypt. the nile is the
    longest river in the world. the nile has lots of crocodiles. crocodiles and alligators
    are different.  crocodiles eat african hares.  crocodiles reptile in egypt."
Kern Hast
  • 21
  • 2
  • Are these similar sections always exactly the same sentences, like in your example? Then it is possible. Otherwise this [SO Post](https://stackoverflow.com/questions/16133184/how-to-detect-that-two-sentences-are-similar) has some pointers. – phiver Sep 07 '19 at 08:06
  • That is also a very interesting question with good answers, but in my case there are two strings which each have a matching "substring" within them. I want to identify the substrings, which will match exactly thanks for the link. – Kern Hast Sep 07 '19 at 13:10
  • Is your definition of a substring in this context a full sentence? Or is it be a subset of a sentence but e.g. 8+ (or n+) words exactly in the same order? – phiver Sep 07 '19 at 13:34
  • n+ words (or characters, either solution works). In the examples, the substring "are the most common land dwelling" was removed from the strings. – Kern Hast Sep 07 '19 at 13:39
  • I had a look this weekend. Sentences are possible in multiple steps, but with ngrams removal for part of the sentences is more work than is warranted for a SO question. – phiver Sep 08 '19 at 17:25
  • Interesting. I wasn't sure if there was a clever way to do it, nothing was jumping out at me. Seems like it might just be a labor intensive solution. Thanks for taking a look at it – Kern Hast Sep 08 '19 at 21:39

0 Answers0