28

I work as support staff in a biology research institute as a student, and Perl seems to be used everywhere. Not for every single project, but it seems that more than half the people here have a few Perl books in/on their office/desk.

Why is Perl used so much in biology?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Kevin
  • 2,361
  • 2
  • 20
  • 20
  • 2
    Presumably because it's a capable interpreted language, and it's been around longer than python? Same way tons of scientific code is written in fortran - it was just *the* compiled language back then. – Cascabel Mar 26 '10 at 22:31
  • 2
    Have you asked the people you work with? – Peter Alexander Mar 26 '10 at 22:34
  • @Poita_: Good call. Of course, they'll probably say something like what Paul and I did - it's just what we use, stuff's written in it... – Cascabel Mar 26 '10 at 22:35
  • This is true -- I should ask the people I interact with. I'll do this. Today, I was talking to one researcher about her assembler application and she told me they were just about finished rewriting it in Java. The first version was written in Perl and she said it was a mess. – Kevin Mar 26 '10 at 22:54
  • 4
    The same reason that other sciences use FORTRAN and game developers use C++: existing libraries. – jrockway Mar 26 '10 at 23:16
  • 2
    @Kevin: she should have used perltidy and perlcritic. – Alexandr Ciornii Mar 27 '10 at 07:53

12 Answers12

47

Lincoln Stein highlighted some of the saving graces of Perl for bioinformatics in his article: How Perl Saved the Human Genome Project.

From his analysis:

I think several factors are responsible:

  1. Perl is remarkably good for slicing, dicing, twisting, wringing, smoothing, summarizing and otherwise mangling text. Although the biological sciences do involve a good deal of numeric analysis now, most of the primary data is still text: clone names, annotations, comments, bibliographic references. Even DNA sequences are textlike. Interconverting incompatible data formats is a matter of text mangling combined with some creative guesswork. Perl's powerful regular expression matching and string manipulation operators simplify this job in a way that isn't equalled by any other modern language.

  2. Perl is forgiving. Biological data is often incomplete, fields can be missing, or a field that is expected to be present once occurs several times (because, for example, an experiment was run in duplicate), or the data was entered by hand and doesn't quite fit the expected format. Perl doesn't particularly mind if a value is empty or contains odd characters. Regular expressions can be written to pick up and correct a variety of common errors in data entry. Of course this flexibility can be also be a curse. I talk more about the problems with Perl below.

  3. Perl is component-oriented. Perl encourages people to write their software in small modules, either using Perl library modules or with the classic Unix tool-oriented approach. External programs can easily be incorporated into a Perl script using a pipe, system call or socket. The dynamic loader introduced with Perl5 allows people to extend the Perl language with C routines or to make entire compiled libraries available for the Perl interpreter. An effort is currently under way to gather all the world's collected wisdom about biological data into a set of modules called "bioPerl" (discussed at length in an article to be published later in the Perl Journal).

  4. Perl is easy to write and fast to develop in. The interpreter doesn't require you to declare all your function prototypes and data types in advance, new variables spring into existence as needed, calls to undefined functions only cause an error when the function is needed. The debugger works well with Emacs and allows a comfortable interactive style of development.

  5. Perl is a good prototyping language. Because Perl is quick and dirty, it often makes sense to prototype new algorithms in Perl before moving them to a fast compiled language. Sometimes it turns out that Perl is fast enough so that of the algorithm doesn't have to be ported; more frequently one can write a small core of the algorithm in C, compile it as a dynamically loaded module or external executable, and leave the rest of the application in Perl (for an example of a complex genome mapping application implemented in this way, see http://waldo.wi.mit.edu/ftp/distribution/software/rhmapper/).

  6. Perl is a good language for Web CGI scripting, and is growing in importance as more labs turn to the Web for publishing their data.

mob
  • 117,087
  • 18
  • 149
  • 283
  • 1
    @mobrule : Regarding point #6, is your analysis from 2010 or did you get that from an old book ??? – Philippe Mar 27 '10 at 13:15
  • 1
    The analysis is excerpted from the linked article, which was from the summer of 1996. – mob Mar 27 '10 at 20:27
  • Sorry I didn't read the source... I was wondering who was still doing cgi scripts for big websites today ;-) – Philippe Mar 27 '10 at 20:49
  • @Philippe: This is pretty old. Bioperl came out ages ago and is widely used. I use python myself for many of the same reasons. – Chinmay Kanchi Apr 07 '10 at 08:40
16

The real answer probably has less to do with Perl than you think. Many of the things that happen are accidents of history. At the time, way back when, Perl was pretty popular, Java was getting more popular, not too many people were paying attention to Python, and Ruby was just getting started.

The people who needed to get work done used Perl and made some libraries in Perl, and other people started using those libraries. Once people start using something that is moderately useful to them, they tend not to switch (economists call those "switching costs"). From there, even more people start using it because a lot of other people are using it.

The same evolution might not happen today. I'd say that Perl, Python, and Ruby are all completely adequate and up to the task. All the things that mobrule quotes from Lincoln Stein could apply to any of the three today. If everyone had to start from scratch today, any one of those languages could be the one that everyone uses.

I've noticed, from my own client base though (a very small and unrepresentative sample of biotech), that the people pushing the programming for a lot of the biological stuff seemed to be at least part-time sysadmins who were supporting scientists. The scientists worried about the science and did some light programming, but the IT support people were doing a lot of the heavy lifting for the non-science parts. Perl is very well positioned as a sysadmin tool since it's the duct-tape of the internet.

Community
  • 1
  • 1
brian d foy
  • 129,424
  • 31
  • 207
  • 592
  • 4
    I tend to disagree here. Perl really is rather expressive, so if your primary concern is not programming, but getting the job done, and getting back to your real job, then the expressiveness of the language helps the computer to think like you, while a more typical language more helps you to think like the computer. – singingfish Mar 27 '10 at 09:56
  • 4
    While Ruby and Python are very similar to Perl their regular expression engines are not as good. They aren't as fast and can't do as many "crazy" things. This normally isn't a problem because if you're doing something really crazy a grammar is probably a better fit anyway, but then you have to teach all those biologists grammars, recursive descent parsing, etc. – mpeters Mar 27 '10 at 18:25
  • 1
    Although Perl might have more power, I find that most people barely use everything in Learning Perl. – brian d foy Mar 27 '10 at 21:05
  • @briandfoy: That's true for almost any language. It's more true for C++ than Perl – slebetman Jan 16 '16 at 16:00
  • 1
    I agree with @singingfish – hepcat72 Sep 06 '16 at 18:45
12

Probably because Perl is good at manipulating strings, and much research in genetics involves the manipulation of veeery long "ACTGCATG..." strings. Just guessing...

Federico A. Ramponi
  • 46,145
  • 29
  • 109
  • 133
  • What makes Perl very good at manipulating strings? – Kevin Mar 26 '10 at 22:52
  • 10
    They've got a really good regular expression engine, and always have had. (Larry Wall is one of the RE engine gods, a shub-niggurath of string manipulation.) – Donal Fellows Mar 26 '10 at 23:00
  • 1
    @Kevin: That was Larry Wall (Perl's creator)'s original intent -- to be a Pathologically Eclectic Rubbish Lister. :) – Ether Mar 26 '10 at 23:02
  • @Donal: heh, we both mentioned Larry at the same time :) – Ether Mar 26 '10 at 23:02
  • @Ether: You should look up "shub-niggurath" to see what sort of god I had in mind. Make sure the room is well-lit first. ;-) – Donal Fellows Apr 07 '10 at 08:39
9

I use lots of Perl for dealing with qualitative and quantitative data in social science research. In terms of getting things done (largely with text) quickly, finding libraries on CPAN (nice central location), and generally just getting things done quickly, it can't be surpassed.

Perl is also excellent glue, so if you have some instrumental records, and you need to glue them to data analysis routines, then Perl is your language.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
singingfish
  • 3,136
  • 22
  • 25
8

Perl seems to be the language of choice for bioinformatics - there's even an O'Reilly title on just this subject: Beginning Perl for Bioinformatics.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 3
    Exactly! But why?! :) Maybe I'll see if I can find a copy of that book since it might have introductory chapter explaining the answer to my question. – Kevin Mar 26 '10 at 22:51
5

Perl is very powerful when it comes to deal with text and it's present in almost every Linux/Unix distribution. In bioinformatics, not only are sequence data very easy to manipulate with Perl, but also most of the bionformatics algorithms will output some kind of text results.

Then, the biggest bioinformatics centers like the EBI had that great guy, Ewan Birney, who was leading the BioPerl project. That library has lots of parsers for every kind of popular bioinformatics algorithms' results, and for manipulating the different sequence formats used in major sequence databases.

Nowadays, however, Perl is not the only language used by bioinformaticians: along with sequence data, labs produce more and more different kinds of data types and other languages are more often used in those areas.

The R statistics programming language for example, is widely used for statistical analysis of microarray and qPCR data (among others). Again, why are we using it so much? Because it has great libraries for that kind of data (see bioconductor project).

Now when it comes to web development, CGI is not really state of the art today, but people who know Perl may stick to it. In my company though it is no longer used...

I hope this helps.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Philippe
  • 6,703
  • 3
  • 30
  • 50
3

Perl basically forces very short development cycles. That's the kind of development that gets stuff done.

It's enough to outweigh Perl's disadvantages.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Andomar
  • 232,371
  • 49
  • 380
  • 404
  • 1
    How does Perl force short dev cycles? – Kevin Mar 26 '10 at 22:54
  • 2
    I think he means, "allows for". How does it allow for short development cycles? Libraries and minimal boilerplate. The code you write solves your problem; it doesn't reinvent the wheel or exist to placate the compiler (hello, java). – jrockway Mar 26 '10 at 23:17
  • Well, "allows" generally means that you make an edit and immediately run the result. There's no a priori need to compile, link, etc. – brian d foy Mar 27 '10 at 00:19
2

Bioinformatics deals primarily in text parsing and Perl is the best programming language for the job as it is made for string parsing. As the O'Reilly book (Beginning Perl for Bioinformatics) says that "With [Perl]s highly developed capacity to detect patterns in data, Perl has become one of the most popular languages for biological data analysis."

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Kyra
  • 5,129
  • 5
  • 35
  • 55
1

People missed out DBI, the Perl abstract database interface that makes it really easy to work with bioinformatic databases.

There is also the one-liner angle. You can write something to reformat data in a single line in Perl and just use the -pe flag to embed that at the command line. Many people using AWK and sed moved to Perl. Even in full programs, file I/O is incredibly easy and quick to write, and text transformation is expressive at a high level compared to any engineering language around. People who use Java or even Python for one-off text transformation are just too lazy to learn another language. Java especially has a high dependence on the JVM implementation and its I/O performance.

At least you know how fast or slow Perl will be everywhere, slightly slower than C I/O. Don't learn grep, cut, sed, or AWK; just learn Perl as your command line tool, even if you don't produce large programs with it. Regarding CGI, Perl has plenty of better web frameworks such as Catalyst and Mojolicious, but the mindshare definitely came from CGI and bioinformatics being one of the earliest heavy users of the Internet.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Matt
  • 11
  • 1
1

This seems to be a pretty comprehensive response. Perhaps one thing missing, however, is that most biologists (until recently, perhaps) don't have much programming experience at all. The learning curve for Perl is much lower than for compiled languages (like C or Java), and yet Perl still provides a ton of features when it comes to text processing. So what if it takes longer to run? Biologists can definitely handle that. Lab experiments routinely take one hour or more finish, so waiting a few extra minutes for that data processing to finish isn't going to kill them!

Just note that I am talking here about biologists that program out of necessity. I understand that there are some very skilled programmers and computer scientists out there that use Perl as well, and these comments may not apply to them.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Daniel Standage
  • 8,136
  • 19
  • 69
  • 116
0

Perl is very easy to learn as compared to other languages. It can fully exploit the biological data which is becoming the big data. It can manipulate big data and perform good for manipulation data curation and all type of DNA programming, automation of biology has become easy due languages like Perl, Python and Ruby. It is very easy for those who are knowing biology, but not knowing how to program that in other programming languages.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
0

Personally, and I know this will date me, but it's because I learned Perl first. I was being asked to take FASTA files and mix with other FASTA files. Perl was the recommended tool when I asked around.

At the time I'd been through a few computer science classes, but I didn't really know programming all that well.

Perl proved fairly easy to learn. Once I'd gotten regular expressions into my head I was parsing and making new FASTA files within a day.

As has been suggested, I was not a programmer. I was a biochemistry graduate working in a lab, and I'd made the mistake of setting up a Linux server where everyone could see me. This was back in the day when that was an all-day project.

Anyway, Perl became my goto for anything I needed to do around the lab. It was awesome, easy to use, super flexible, other Perl guys in other labs we're a lot like me.

So, to cut it short, Perl is easy to learn, flexible and forgiving, and it did what I needed.

Once I really got into bioinformatics I picked up R, Python, and even Java. Perl is not that great at helping to create maintainable code, mostly because it is so flexible. Now I just use the language for the job, but Perl is still one of my favorite languages, like a first kiss or something.

To reiterate, most bioinformatics folks learned coding by just kluging stuff together, and most of the time you're just trying to get an answer for the principal investigator (PI), so you can't spend days on code design. Perl is superb at just getting an answer, it probably won't work a second time, and you will not understand anything in your own code if you see it six months later; BUT if you need something now, then it is a good choice even though I mostly use Python now.

I hope that gives you an answer from someone who lived it.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131