2

I want to create an application that can determine if some text was copied between two documents by reading the text from the two documents and comparing them. I wanted to know if anyone had ever tried to do this and what was the best way of handling the same. If machine learning and natural language processing are involved: to what level?

Goodman
  • 158
  • 11

2 Answers2

1

There are techniques which rely purely on set-theoretic concepts

Try http://en.wikipedia.org/wiki/W-shingling for a good start.

Viktor Latypov
  • 14,289
  • 3
  • 40
  • 55
0

I believe Copyscape uses 4-grams to help determine uniqueness.

These strings are referred to as N-Grams.

However, another SO answer linked to a language independent algo comparing bi-grams on a character basis. It's already implemented in Java, which would help save time.

Community
  • 1
  • 1
HappyTimeGopher
  • 1,377
  • 9
  • 14