I am working on classification of some documents and a number of the documents have large sections of similar (and usually irrelevant) text. I would like to identify and remove those similar sections, as I believe I may be able to make a better model.
An example would be proposals by an organization, each of which contains the same paragraph regarding the organization's mission statement and purpose.
A couple points which make it difficult:
- similar sections are not known ahead of time, making a fixed pattern inappropriate
- could be located anywhere in the documents, documents do not have consistent structure
- the pattern could be many characters long, e.g. 3000+ characters
- I don't want to remove every similar word, just large sections
- I don't want to identify which strings are similar, rather I want to remove the similar sections.
I've considered regex and looked through some packages like stringr, strdist, and the base functions, but these utilities seem useful if you already know the pattern and the pattern is much shorter, or if the documents have a similar structure. In my case the text could be structured differently and the pattern is not predefined, but rather whatever is similar between the documents.
I considered making and comparing lists of 3000-grams for each document but this didn't seem feasible or easy to implement.
Below is an example of a complete solution, but really I am not even sure how to approach this problem, so information in that direction would be useful as well.
Example code
doc_a <- "this document discusses african hares in the northern sahara. african hares
are the most common land dwelling mammal in the northern sahara. crocodiles eat
african hares. this text is from a book written for the foundation for education
in northern africa."
doc_b <- "this document discusses the nile. The nile delta is in egypt. the nile is the
longest river in the world. the nile has lots of crocodiles. crocodiles and
alligators are different. crocodiles eat african hares. crocodiles are the most common
land dwelling reptile in egypt. this text is from a book written for the foundation
for education in northern africa."
# this function would trim similar sections of 6 or more words in length
# (length in characters is also acceptable)
trim_similar(doc_a, doc_b, 6)
Output
[1] "this document discusses african hares in the northern sahara. african hares
mammal in the northern sahara. crocodiles eat african hares."
[2] "this document discusses the nile. The nile delta is in egypt. the nile is the
longest river in the world. the nile has lots of crocodiles. crocodiles and alligators
are different. crocodiles eat african hares. crocodiles reptile in egypt."