14

I've searched the Internet for a while now and I have not been able to find any free (or cheap) tools/utilities/modules that can analyze a set of Perl files (modules or scripts) and flag duplicate or cloned or copy/pasted code.

I'm better now, but I used to copy and paste sections of code all over the place. I'd like to clean it up and fix my old code duplication, but a little bit of tool help would be appreciated so I won't have to go through all my old code with a fine tooth comb. Plus, manual recognition of this sort of offense is error prone.

Kurt W. Leucht
  • 4,725
  • 8
  • 33
  • 45
  • You might find this Perl Monks node interesting: http://www.perlmonks.org/index.pl?node_id=667084 – daotoad Sep 22 '09 at 21:30
  • 2
    Better would be to not copy-paste code in the first place. It would be worthwhile to go over your old code anyway and (re)familiarize yourself with it; unless you've got millions of lines of code, you should have a general concept of it in your head anyway, and be aware of potential candidates to rewrite/refactor. – Ether Sep 22 '09 at 22:04
  • Related question - http://stackoverflow.com/questions/2490884/why-is-copy-and-paste-of-code-dangerous – Oded Jul 25 '10 at 17:45

6 Answers6

5

Funny a similar question was posted to SO just a few minutes ago.

Here is a link with some tools you may find useful.

Code Comparison and Plagirism Detection

Community
  • 1
  • 1
RC.
  • 27,409
  • 9
  • 73
  • 93
  • Could you please link to that other question? – innaM Sep 22 '09 at 18:56
  • 1
    http://stackoverflow.com/questions/1461805/how-can-i-compare-similar-codebases -- a similar question about C++ – mob Sep 22 '09 at 19:01
  • I'm evaluating the CodeMatch product. Had to get on a corporate newsletter email list to download the software, though. Luckily, I used a disposable email address. – Kurt W. Leucht Sep 22 '09 at 19:17
  • mobrule posted the SO link I was referring too. Thanks. – RC. Sep 22 '09 at 19:21
  • 1
    Not impressed with CodeMatch. It compares two sets of files for similarity to each other, but does not appear to search and find duplicate code within a single file. I'm uninstalling it as I type this. – Kurt W. Leucht Sep 22 '09 at 19:31
4

What do you mean by duplicate code? Just character exact matches or semantic matches.

There are several tools like http://pmd.sourceforge.net/ that can detect duplicate code by string matches, this tool is for java but the source matching works on plain text.

If you want semantic matching, like

sub A
{return 1;}

to match

sub B
{
    return 1;
}

Then you'll need something else:(

chollida
  • 7,834
  • 11
  • 55
  • 85
  • Thanks. I just tried PMD plugin for Eclipse and it does not appear to be able to scan perl (or plain text) files. Choices are Java, JSP, CPP, C, PHP, Ruby, Fortran. For giggles, I tried a couple and it gives me an empty copy/paste report. – Kurt W. Leucht Sep 22 '09 at 19:01
  • By default it looks for blocks that are about 30 lines long. We use it for our in house language, loosely based off of Javascript and it works fine for us. – chollida Sep 22 '09 at 19:05
  • 3
    You can run all the code through perltidy to smooth out the stylistic differences (but not the subroutine names). – Schwern Sep 22 '09 at 23:23
  • SD's CloneDR can do this particular "semantic" match just fine, because you only need do a syntactic match, with substitions, over the language structure. CloneDR uses a parser for the actual langaue to eliminate differences in whitespaces, comments, and to detect when a construct can be parameterized to produce such a match. See www.semanticdesigns.com/Products/Clone – Ira Baxter Jan 28 '10 at 23:13
2

I have used CCFinder in the past to find sections of code which are duplicates. It works quite well but has an.. interesting interface. It doesn't have native support for perl, but it does have a plaintext option which should work for detection of copy and pasting at least. There is a Windows and Ubuntu solution - Freeware, not open source unfortunately.

jamuraa
  • 3,419
  • 25
  • 29
  • Oh wow ... this is a great utility! And the way it visually shows you your duplicate code on a scatter plot is amazing! I think this is the coolest piece of free software that I've ever experienced. The user interface is a bit klunky at first, but once you get used to the interface, it is a wonderfully powerful code duplication analyzer. Two nits, though. It's not cross-platform. And it leaves a bunch of temporary files behind in your source code tree. – Kurt W. Leucht Oct 22 '09 at 20:10
  • I was able to easily modify one of the python files to recognize and ignore POD and Perl comments. Now I'm even more excited about CCFinder! (Had to remove all the temporary files by hand and restart to make it work, though.) – Kurt W. Leucht Oct 23 '09 at 15:14
  • The source code is available under MIT license: http://www.ccfinder.net/ccfinderxos.html – hexcoder May 13 '13 at 10:32
0

Semantic Designs makes a product called Clone Dr. that appears to be able to analyze a large number of language types for cloned sections of code. But it appears that their free evaluation version only works on Java and Cobol.

Kurt W. Leucht
  • 4,725
  • 8
  • 33
  • 45
  • I'm the CloneDR product manager. It provides (we think) really good results by virtue of comparing ASTs for programs, which gets rid of any formatting issues completely. It does handle a lot of languages, but Perl isn't presently one of them. After all, "only Perl can parse Perl" :-} [Actually, we have very good parsing engines; we'll get to Perl someday.] – Ira Baxter Sep 27 '09 at 09:25
  • 1
    Good to know. There may not be a ton of customers out there for Perl, though. I tried your evaluation version of Clone Dr. on an old JAVA project of mine a while back and I was impressed with the results. It was this experience that made me realize that I needed to analyze all the rest of my code (some of which includes some large Perl scripts) for copy/paste offenses. – Kurt W. Leucht Sep 28 '09 at 16:06
  • You can get evaluation versions for Java, C#, C, C++, COBOL and PHP. You might have to ask at the web site. – Ira Baxter Jan 28 '10 at 23:14
  • @Ira: so you intend to run (parts of) the code during the analysis? – SamB Mar 13 '12 at 22:29
  • @KurtW.Leucht: We're making progress on Perl. We've done a trial run of Perl parsing and used the CloneDR on it, on our internal scripts. Seems to basically work. We have some polishing to do of the Perl parser, and nobody pushing hard for product, so it is likely a while yet. – Ira Baxter Mar 13 '12 at 23:11
  • @SamB: Where did you get that idea? CloneDR is a static analysis tool; it reads and parses the source code into compiler data structures calls "abstract syntax trees", and compares the code by comparting those ASTs. – Ira Baxter Mar 13 '12 at 23:12
  • @IraBaxter: Well, see http://search.cpan.org/~adamk/PPI-1.215/lib/PPI.pm#Background – SamB Mar 13 '12 at 23:43
  • @SamB: I've heard this before. What this says is that some parts of perl don't have a a unique context-free parse, which means a parser trying to produce a single parse will fail. OK, so get a parser that can tolerate ambiguity and multiple parses (we have one). It also says that if you try hard enough, you can make a Perl program has ambiguous parses in many places, maybe even ambiguously ambiguous. As a practical matter, most Perl programs aren't like this. So a parser that can parse most of Perl program reasonably, and manage the parts that are ambiguous can do just fine. We do this. – Ira Baxter Mar 14 '12 at 00:16
0

I just evaluated Simian. It has a 15 day free evaluation period and costs a hundred bucks for a single user license. It doesn't officially support Perl, but it does treat them as plain text and analyzes them anyways. This is a super fast utility! And super easy to use. The report generated from this tool was simple and easy to interpret. I totally approve of this tool. Now I just need to talk to my boss and get him to purchase a license.

Kurt W. Leucht
  • 4,725
  • 8
  • 33
  • 45
  • P.S. I emailed the Simian developers and asked them if they intended to support Perl, and they immediately wrote back that supporting Perl had never occurred to them, but that they would put it on their to-do list. I'm not even a paying customer. Now that's great support. (unless they were just blowing me off) – Kurt W. Leucht Oct 11 '09 at 03:27
  • Did they? Inquiring minds want to know. I think Simian requires lexical analysis, and Perl is bitch to lex let alone parse. – Ira Baxter Mar 13 '12 at 23:13
0

Here's another web page listing some clone detection tools:

http://sel.ics.es.osaka-u.ac.jp/cdtools/index-e.html

Kurt W. Leucht
  • 4,725
  • 8
  • 33
  • 45