-2

I'm trying to compare a lot of scripts at once and most of them have small differences, like a different name inside a variable and such.

For the most part, the scripts should be identical in function, and I'd like to be able to test how actually different they are.

What I'm thinking of doing is taking in all of the input from both files and comparing them against each other, character by character, and increasing a count of some sort when a difference arises. I'm not sure what I would compare this count to to make a percentage, or if this is even the best way to go about this.

If you have an idea or advice to give me I would greatly appreciate it!

Zach
  • 4,555
  • 9
  • 31
  • 52
  • 1
    See http://stackoverflow.com/questions/8566396/is-there-any-working-real-open-source-plagiarism-checker-available ? – dg99 Jul 11 '14 at 16:14
  • 2
    May I ask what your end goal is? There are a number of diff tools available, my favorite of which is [Beyond Compare](http://www.scootersoftware.com/moreinfo.php?zz=screenshot&shot=TextCompare). – Cory Kramer Jul 11 '14 at 16:15
  • 2
    If you wanted to learn a useful algorithm, you should look up "edit distance". It can be found in a chapter of this book: http://www.cs.berkeley.edu/~vazirani/algorithms/chap6.pdf Though, I don't imagine edit distance to be the greatest measure of script differences in general. – Zhouster Jul 11 '14 at 16:16

1 Answers1

2

Two suggestions:

1) Check out this SO question and Python's difflib. This SO question specifically asks about difflib.

Also, a guy named Doug Hellmann has an excellent series of blog posts called Python Module of the Week (PyMOTW). Here is his post about difflib.

2) If those don't work for you, try searching for language-independent algorithms for file comparisons first, and think about which ones can be most easily implemented in Python. A simple Google search for "file comparison algorithms" came up with several decent looking possibilities that you could try to implement in Python:

Here is a published PDF with a diff algorithm

This site has a discussion of several different algorithms with links

Community
  • 1
  • 1
skrrgwasme
  • 9,358
  • 11
  • 54
  • 84