0

I'm interested in comparing code to find matches, i.e. to see if two different pieces of code are equivalent. For example, here are 4 matches for a method that returns the sum of two numbers (in Java).

int sum(int a, int b){
  return a + b;
}

int sum(int a, int b){
  return b + a;
}

int sum(int a, int b){
  int sum = a + b;
  return sum; 
}

int sum(int a, int b){
  int total = a + b;
  return total; 
}

While it's easy to do a text comparison of two pieces of source code, its difficult to write code that will recognize the above matches. This seems to be a job for a Parser or Compiler, but it doesn't need to be 'perfect', since it's just looking for matches.

This is for a Rails website, so ideally it should be able to work within Ruby, but I can also run a separate service. Treetop is a language for describing grammars, but describing the grammars is difficult too. Is there an existing tool to compare source code for multiple languages (such as Java, C++, Ruby and Python)?

It just needs to find matches between source code in one language at a time, though it would be cool if it could find matches between source codes in different languages too.

Update: A match isn't any code that produces the same result, it's code that uses the same process and steps to get the same result. The tool doesn't need to find every possible match, but it should be able to recognize code that is identical except for small differences, like variable names or order (as in above examples).

Ari
  • 1,974
  • 19
  • 30
  • You could write unit tests and run them against the different implementations. If you want to test Java methods from Ruby, you will need to use JRuby. This may be worth a read: https://github.com/jruby/jruby/wiki/CallingJavaFromJRuby – Patrick Oscity Nov 14 '13 at 14:15
  • @plly, I want it to be able to find matches without running the code. Also a match isn't any code with the same result, but code with the same approach (without the text exactly matching). – Ari Nov 14 '13 at 14:18
  • 1
    Possible duplicate: http://stackoverflow.com/q/3450907/120163 – Ira Baxter Nov 15 '13 at 10:40

3 Answers3

3

This problem is known as the Function Problem: figuring out whether two programs compute the same function. It is known to be undecidable, i.e. such a tool cannot possibly exist.

Basically, if you had such a tool, then you could ask: is some program P equivalent to this program:

while (true);

and you would have solved the Halting Problem. (That's not actually how the proof goes, it's much more complicated than that, but that's the basic idea.)

Jörg W Mittag
  • 363,080
  • 75
  • 446
  • 653
  • I don't need it to find every match, just to find more similar matches than simple text comparison. For example, comparing Java byte code would find many more matches than comparing Java source code. However, I want something that can compare multiple languages, even roughly. – Ari Nov 14 '13 at 16:30
1

For ruby take a look at https://github.com/seattlerb/flay For C# Resharper can be told to look for specific code structures ignoring names. Not quite what you are looking for, but powerful.

I know of nothing that lets you compare between languages.... except possibly if you uses Reflector you could decompile .net bytecode back to C# then use resharper, thereby converting between .net languages.

Nigel Thorne
  • 21,158
  • 3
  • 35
  • 51
  • Will look at Flay. I wonder of there's a way to modify it to work with other languages... – Ari Nov 15 '13 at 05:54
1

Look at PMD CPD, which supports many languages, and also has some good ideas regarding what to ignore during comparison, etc.

Also look at minification. You could probably improve upon that, since you don't need the result to still work as code, like minificators do. But you likely won't find many minificators for compiled languages. There's also one potential pitfall I see here - the minified versions of, for example, two functions with only the parameters shuffled might turn out less similar when minified, depending on how the minifier renames parameters (they usually just name them in order, e.g. a,b,c,...).

Nigel mentioned compiling .NET languages to bytecode, and decompiling back - the same is also possible for JVM bytecode, and maybe even for binaries (or things like LLVM IR), but most of this is too low-level for what you're trying to do, only covers a few languages per approach, and is also likely very hard or impossible for some of the approaches.

If you wanted to make a very hacky approximation of a common language, you could try picking a few common things such as function headers, loops, braces/indentation, and try to make languages more similar with very simple parsers (using just text-replace and regexes, for instance). e.g. you could replace Java's public int func(int a, char b) with def func(a,b), and for Ruby, Scala and Python you'd need to do almost nothing in this case. This is a terrible idea, but some of these transformations are easy to write, so if all else fails, try and see if it gets you anywhere. And if you do, remember to write unit tests - simple parsers for complex languages break very easily.

edit: One thing to also look at might be programming homework plagiarism detectors, e.g. http://theory.stanford.edu/~aiken/moss/

HairyFotr
  • 1,453
  • 1
  • 15
  • 28