9

Why it's a bad idea to commit Java jar files into a repository (CVS, SVN..)

Neel
  • 9,913
  • 16
  • 52
  • 74
  • Could you clarify if you are talking about third-party jars or jars that are generated from your own source code? – KevinS Jan 10 '11 at 16:52
  • Both. jar files which are generated from sources owned by us and third-party/open-source jar files. – Neel Jan 10 '11 at 16:55
  • 1
    This can be debated forever, my preference is to include the jars and NOT use a dependency engine as they just introduce another layer of complexity for an incredibly simple problem to manage. – Randyaa Jan 10 '11 at 17:17
  • 3
    Whatever you do, make sure when someone checks out the project they can just run your build script and they'll have everything they need, whether you use a dependency engine like Ivy or Maven or just manage the libraries yourself. – Randyaa Jan 10 '11 at 17:18

5 Answers5

8

Because you can rebuild them from the source. On the hand if you are talking about third-party JAR files which are required by your project then it is a good idea to commit them into the repository so that the project is self-contained.

Darin Dimitrov
  • 1,023,142
  • 271
  • 3,287
  • 2,928
  • 7
    Well, for dependencies, the solution is not in SCM, but rather in using a dependency management tool (like Ivy or Maven), in order to have their definition in SCM, but the effective JARs elsewhere. – Riduidel Jan 10 '11 at 16:35
  • @Riduidel - this should be an answer – Anon Jan 10 '11 at 16:41
  • @Riduidel - Could you describe why you think it is a good idea to store jars elsewhere? The poster has probably seen comments like yours before which prompted him to ask his question and its one that's been puzzling me too. – KevinS Jan 10 '11 at 17:54
  • @Anon @Kevin_Stembridge I made a reply of this comment, thanks for you comments. – Riduidel Jan 11 '11 at 10:00
7

So, you have a project that use some external dependencies. This dependencies are well known. They all have

  • A group (typically, the organization/forge creating them)
  • An identifier (their name)
  • A version

In maven terminology, these informations are called the artifact (your Jar) coordinates.

The dependencies I was talking about are either internal (for a web application, it can be your service/domain layer) or external (log4j, jdbc driver, Java EE framework, you name it, ...). All those dependencies (also called artifacts) are in fact, at their lowest level, binary files (JAR/WAR/EAR) that your CVS/SVN/GIT won't be able to store efficently. Indeed, SCM use the hypothesis that versionned content, the one for which diff operations are the most efficient) is text only. As a consequence, when binary data is stored, their is rarely storage optimization (contrary to text, where only versions differences are stored).

As a consequence, what I would tend to recommand you is to use a dependency management build system, like maven, Ivy, or Gradle. using such a tool, you will declare all your dependencies (in fact, in this file, you will declare your dependencies' artifacts coordinates) in a text (or maybe XML) file, which will be in your SCM. BUT your dependencies won't be in SCM. Rather, each developper will download them on its dev machine.

This transfers some network load from the SCM server to the internet (which bandwidth is often more limitated than internal enterpise network), and asks the question of long-term availability of artifacts. Both of these answers are solved (at least in amven work, but I believe both Ivy and gradle are able to connect to such tools - and it seems some questions are been asked on this very subject) using enterprises proxies, like Nexus, Artifactory and others.

The beauty of these tools is that they make available in internal network a view of all required artifacts, going as far as allowing you to deploy your own artifacts in these repositories, making sharing of your code both easy and independant from the source (which may be an advantage).

To sum up this long reply : use Ivy/Maven/Gradle instead of simple Ant build. These tools will allow you to define your dependencies, and do all the work of downloading these dependencies and ensuring you use the declared version.

On a personnal note, the day I discovered those tools, my vision of dependency handling in Java get from nightmare to heaven, as I now only have to say that I use this very version of this tool, and maven (in my case), do all the background job of downloading it and storing at the right location on my computer.

Arjan Tijms
  • 37,782
  • 12
  • 108
  • 140
Riduidel
  • 22,052
  • 14
  • 85
  • 185
  • CVS does not store binary files efficiently. However, SVN (and I would guess Git, Mercury etc) stores everything in an efficient binary format, even text files. – KevinS Jan 11 '11 at 11:04
  • By default, Mercurial doesn't store binary files in an efficient manner. It stores the bytes and if one byte changes in the file, a whole copy of that file is stored again. See the "largefiles" extension to work with binary files (but it comes with tradeoffs) – David I. Aug 02 '12 at 21:53
  • Also, maven allows you to link projects within Eclipse/intellij when you are working on a library without mucking about with the classpath to point to a project instead of the library. And of course, it manages all the transitive depedencies and takes care of potentially overlapping .jar files (after all, it's a "dependency manager") Versioning is simpler, overall, checking .jars into source control is just painful. – cgp Dec 18 '14 at 16:08
4

Source control systems are designed for holding the text source code. They can hold binary files, but that isn't really what they are designed for. In some cases it makes sense to put a binary file in source control, but java dependencies are generally better managed in a different way.

The ideal setup is one that lets you manage your dependencies outside of source control. You should be able to manage your dependencies outside of the source and simply "point" to the desired dependency from within the source. This has several advantages:

  • You can have a number of projects dependent on the same binaries without keeping a separate copy of each binary. It is common for a medium sized project to have hundreds of binaries it depends on. This can result in a great deal of duplication which wastes local and backup resources.
  • Versions of binaries can be managed centrally within your local environment or within the corporate entity.
  • In many situations, the source control server is not a local resource. Adding a bunch of binary files will slow things down because it increases the amount of data that needs to be sent across a slower connection.
  • If you are creating a war, there may be some jars you need for development, but not deployment and vice versa. A good dependency management tool lets you handle these types of issues easily and efficiently.
  • If you are depending on a binary file that comes from another one of your projects, it may change frequently. This means you could be constantly overwriting the binary with a new version. Since version control is going to keep every copy, it could quickly grow to an unmanageable size--particularly if you have any type of continuous integration or automated build scripts creating these binaries.
  • A dependency management system offers a certain level of flexibility in how you depend on binaries. For example, on your local machine, you may want to depend on the latest version of a dependency as it sits on your file system. However, when you deploy your application you want the dependency packaged as a jar and included in your file.

Maven's dependency management features solve these issues for you and can help you locate and retrieve binary dependencies as needed. Ivy is another tool that does this as well, but for Ant.

Mark
  • 1,228
  • 1
  • 10
  • 18
  • Hi Mark, your first two sentences are true with regards to CVS but not for SVN (and I would guess most modern SCMs). http://svnbook.red-bean.com/en/1.5/svn.forcvs.binary-and-trans.html – KevinS Jan 21 '11 at 16:54
  • Kevin, I realize that most SCMs can hold binary information. I'm just saying that they were mainly built for storing text. Many of the tools you'll use with an SCM are only meaningful when dealing with text files. Also if you are storing large .jar files in your SCM and they change (in filename and contents) as you upgrade to different versions, your repository can become quite bloated with all the different versions of the binary files. In some cases this might not matter, but in others it can slow down your operations and make backups more of a problem. – Mark Jan 22 '11 at 05:01
  • Hi Mark, with the exception of the ancient CVS, its not true that SCMs are built for storing text. As far as storage is concerned, all files are binary and they use an efficient binary differencing algorithm. – KevinS Jan 31 '11 at 12:05
  • Take all the code that goes into an SCM and divide it into three sections: 1. all the code that is primarily for working with text. 2. all the code that is primarily for working with binary files 3. all the code that is generic and used for both types of files. I still maintain that the quantity of code in 1 and 3 will be greater than the code in 2 and 3. This is because some of the most complicated parts of any SCM system deal with combining changes--something you don't do with binary files. So while they will work fine with binary, that isn't their primary purpose or design. – Mark Jan 31 '11 at 19:12
  • I'd add one more point - properly configured dependency management systems are able to keep track of transitive dependencies, if you do not include them, you usually get ClassNotFound at runtime, and often only if you use some corner functionality of your product that is not part of your automated tests. – Pavel Oct 24 '12 at 13:49
3

They are binary files:

  • It's better to reference the source, since that's what you're using source control for.
  • The system can't tell you which differences between the files
  • They become a source of merge-conflicts, in case they are compiled from the source in the same repository.
  • Some systems (e.g. SVN) don't deal quite well with large binary files.

In other words, better reference the source, and adjust your build scripts to make everything work.

vdboor
  • 21,914
  • 12
  • 83
  • 96
  • Are you sure SVN doesn't handle binary files well? From the SVN docs it treats binary and text files identically. http://svnbook.red-bean.com/en/1.5/svn.forcvs.binary-and-trans.html Also, have a look at this community wiki page: http://stackoverflow.com/questions/538643/how-good-is-subversion-at-storing-lots-of-binary-files – KevinS Jan 10 '11 at 16:50
2

The decision to commit jar files to SCM is usually influenced by the build tool being used. If using Maven in a conventional manner then you don't really have the choice. But if your build system allows you the choice, I think it is a good idea to commit your dependencies to SCM alongside the source code that depends on them.

This applies to third-party jars and in-house jars that are on a separate release cycle to your project. For example, if you have an in-house jar file containing common utility classes, I would commit that to SCM under each project that uses it.

If using CVS, be aware that it does not handle binary files efficiently. An SVN repository makes no distinction between binary and text files.

http://svnbook.red-bean.com/en/1.5/svn.forcvs.binary-and-trans.html

Update in response to the answer posted by Mark:

WRT bullet point 1: I would say it is not very common for even a large project to have hundreds of dependencies. In any case, disk usage (by keeping a separate copy of a dependency in each project that uses it) should not be your major concern. Disk space is cheap compared with the amount of time lost dealing with the complexities of a Maven repository. In any case, a local Maven repository will consume far more disk space than just the dependencies you actually use.

Bullet 3: Maven will not save you time waiting for network traffic. The opposite is true. With your dependencies in source control, you do a checkout, then you switch from one branch to another. You will very rarely need to checkout the same jars again. If you do, it will take only minutes. The main reason Maven is a slow build tool is all the network access it does even when there is no need.

Bullet Point 4: Your point here is not an argument against storing jars in SCM and Maven is only easy once you have learned it and it is only efficient up to the point when something goes wrong. Then it becomes difficult and your efficiency gains can disappear quickly. In terms of efficiency, Maven has a small upside when things work correctly and a big downside when they don't.

Bullet Point 5: Version control systems like SVN do not keep a separate copy of every version of every file. It stores them efficiently as deltas. It is very unlikely that your SVN repository will grow to an 'unmanageable' size.

Bullet Point 6: Your point here is not an argument against storing files is SCM. The use case you mention can be handled just as easily by a custom Ant build.

KevinS
  • 7,715
  • 4
  • 38
  • 56