creating a document comparison software

Question

I want to create an application that can determine if some text was copied between two documents by reading the text from the two documents and comparing them. I wanted to know if anyone had ever tried to do this and what was the best way of handling the same. If machine learning and natural language processing are involved: to what level?

lots of people have tried this. turnitin.com is just one example. — emory, May 12 '12 at 19:47
wanted to check plagiarism by comparing the texts in the two documents — Goodman, May 12 '12 at 19:58

score 1 · Answer 1 · answered May 12 '12 at 20:06

1

There are techniques which rely purely on set-theoretic concepts

Try http://en.wikipedia.org/wiki/W-shingling for a good start.

answered May 12 '12 at 20:06

Viktor Latypov

14,289
3
40
55

score 0 · Accepted Answer · edited May 23 '17 at 11:53

0

I believe Copyscape uses 4-grams to help determine uniqueness.

These strings are referred to as N-Grams.

However, another SO answer linked to a language independent algo comparing bi-grams on a character basis. It's already implemented in Java, which would help save time.

edited May 23 '17 at 11:53

Community

1
1

answered May 12 '12 at 21:28

HappyTimeGopher

1,377
9
14

creating a document comparison software

2 Answers2