11

With scala, using sbt for builds and git for version control, what would be a good way of organizing your team code when it outgrows being a single project? At some point, you start thinking about separating your code into separate libraries or projects, and importing between them as necessary. How would you organize things for that? or would you avoid the temptation and just manage all packages under the same sbt and git single "project"?

Points of interest being: (feel free to change)

  • Avoiding inventing new "headaches" that over-engineer imaginary needs.
  • Still being able to easily build everything when you still want to, on a given dev machine or a CI server.
  • Packaging for production: being able to use SbtNativePackager to package your stuff for production without too much pain.
  • Easily control which version of each library you use on a given dev machine, and being able to switch between them seamlessly.
  • Avoiding git manipulation becoming worse than it basically typically is.

In addition, would you use some sort of "local sbt/maven team repository" and what may need to be done to accomplish that? hopefully, this is not necessary though.

Thanks!

matanster
  • 15,072
  • 19
  • 88
  • 167
  • 1
    It basically depends heavily on the nature of the project and to what extent it demands, or you need, modularity. Anyway, one option is to keep them as separate modules but in a [multi-project](http://www.scala-sbt.org/0.13.5/docs/Getting-Started/Multi-Project.html) configuration. That way you can aggregate them in the parent project so it can be considered as a good thing especially when the team is in early stages of the development. By this you keep the option of separating them easily later but you can build all and run tests with a single `sbt` command. – Nader Ghanbari Oct 21 '14 at 12:50
  • This is great, but I'm not sure I follow about the classpath dependencies as defined there. Do they mean that one project will automagically get the classpath of it's `classpath dependency`, or does it also mean that compiling one will _always_ compile the other? – matanster Nov 05 '14 at 14:00
  • By classpath dependency they mean inter-module dependencies which is pretty flexible, in a sense that you can depend `test` on `test` and `compile` on `test` or `compile` on `compile`or even `compile` on `test` which is very useful. So, briefly, it means that when project `A` depends on `B` by `.dependsOn(B)`, by default you can use all classes in project `B` in project `A`. – Nader Ghanbari Nov 05 '14 at 15:47
  • But aggregation is a different thing, it means that when project `A` aggregates projects `B` and `C` (independent of being dependent on them or not) when you build `A`, `B` and `C` will be built automatically. This can be very useful as well when you want to test, or compile all of them together. – Nader Ghanbari Nov 05 '14 at 15:49
  • So then I take it that classpath dependencies also take care of automatically making the dependency projects' outputs available on the classpath... I guess using git submodules or git subtrees per project of a `multi-project` you get good flexibility around versioning it all... – matanster Nov 05 '14 at 20:35
  • 1
    Yes, they do. And yes, that's a good way to manage sub-projects. – Nader Ghanbari Nov 06 '14 at 04:36

2 Answers2

14

I use the following lines in the sand:

  • Code which ultimately goes in different deployables goes in different folders in the same repository, under an umbrella project - what SBT calls a multi-project build (I use maven rather than SBT but the concepts are very similar). It will be built/deployed to different jars.

I try to consider the final deployables when making divisions that make sense. For example, if my system foosys has foosys-frontend and foosys-backend deployables, where foosys-frontend does HTML templating and foosys-backend talks to the database and the two communicate via a REST API, then I'll have those as separate projects, and a foosys-core project for common code. foosys-core isn't allowed to depend on the html templating library (because foosys-backend doesn't want that), nor on the ORM library (because foosys-frontend doesn't want that). But I don't worry about separating the code that works with the REST library from the "core domain objects", because both foosys-frontend and foosys-backend use the REST code.

Now supose I add a new foosys-reports deployable, which accesses the database to do some reports. Then I'll probably create a foosys-database project, depending on foosys-core, to hold shared code used by both foosys-backend and foosys-reports. And since foosys-reports doesn't use the REST library, I should probably also split out foosys-rest from foosys-core. So I end up with a foosys-core library, two more library projects that depend on it (foosys-database and foosys-rest), and the three deployable projects (foosys-reports depending on foosys-database, foosys-frontend depending on foosys-rest, and foosys-backend depending on both).

You'll notice that this means there's one code project for every combination of deployables where that code might be used. Code that goes in all three deployables goes in foosys-core. Code that goes in just one deployable goes in that deployable's project. Code that goes in two of the three deployables goes in foosys-rest or foosys-database. If we wanted to have some code that was part of the foosys-frontend and foosys-reports deployables, but not the foosys-backend deployable, we'd have to create another project for that code. In theory this means an exponential blowup in the number of projects as we add more deployables. In practice I've found it's not too problematic - most theoretically possible combinations don't actually make sense, so as long as we only create new projects when we actually have code to put in them it's ok. And if we end up with a couple of classes in foosys-core that aren't actually used in every single deployable, it's not the end of the world.

Tests are best understood in this view as another kind of deployable. So I would have a separate foosys-test project containing common code that was used for tests for all three deployable projects (depending on foosys-core), and perhaps a foosys-database-test project (depending on foosys-test and foosys-database) for test helper code (e.g. database integration test setup code) that was common between foosys-backend and foosys-reports. Ultimately we might end up with a full parallel hierarchy of -test projects.

  • Only move projects into separate git repositories (and, at the same time, separate overall builds) once they have different release lifecycles.

Code in different repositories is necessarily versioned independently, so in some sense this is a vacuous definition. But I think you should move on to separate git repositories only when you have to (analogy with this post: you should only use Hadoop when your data is too big to use anything friendlier). Once your code is in multiple git repositories, you have to manually update the dependencies between them (on a dev machine you can use -SNAPSHOT dependencies and IDE support to work as though the versions were still in sync, but you have to manually update this every time you resync with master, so it adds friction to development). Since you're doing releases and updating the dependency asynchronously, you have to adopt and enforce something like semantic versioning, so that people know when it's safe to update the dependency on foocorp-utils and when it isn't. You have to publish changelogs, and have an early-warning CI build, and a more thorough code review process. All this is because the feedback cycle is a lot longer; if you break something in a downstream project, you won't know about this until they update their dependency on foocorp-utils, months or even years later (yes, years - I have witnessed this, and in an 80-person startup, not a megacorp). So you need process to prevent that, and everything becomes correspondingly less agile.

Valid reasons to do this include:

  • A full build of your project is taking too long, slowing down integration on the code you're working on - though try to speed it up first.
  • Deploying all your deployables is taking too long - though again, try to automate this and speed it up. There's a real advantage from keeping everything in sync, you don't want to give it up until you absolutely have to.
  • Separate teams need to work on the code. If you're not in constant communication with each other then you'll need the process overhead (semantic versioning etc.) anyway, so you may as well get the faster build times. (To be clear, I think every git repository should have a single team that owns and is responsible for it, and when teams split they should split repositories. I have further thoughts on release processes and responsibilities, but this answer is already pretty long).

I would use a team maven repository, probably Nexus. Actually I'd recommend this even before you get to the multi-project stage. It's very easy to run (just a Java app), and you can proxy your external dependencies through it, meaning you have a reliable source for your dependency jars and your builds will be reproducible even if one of your upstream dependencies disappears.

I intend to write up my ways of team working as a blog post, but in the meantime I'm happy to answer any further questions.

lmm
  • 17,386
  • 3
  • 26
  • 37
  • Thanks @Imm, whereas my scenario may have a different mix of nuances, this deliberation is very helpful! also a link to the blog post, if written, would be nice here in the future. Nexus looks cool indeed - good to know. I like its proxying feature as advertised, which seems to eliminate the fragile temporal dependency on external resources. I wonder in a small way however, at what point does the free version of it no longer suffice and you need to make the leap.... – matanster Nov 05 '14 at 19:58
  • Should I take it that using Nexus, you do not employ git submodules nor [git subtree](http://blogs.atlassian.com/2013/05/alternatives-to-git-submodule-git-subtree/) then, as each project just fetches its dependencies from Nexus? doesn't that go awry at those times when you want to change several repos more or less concurrently, say when doing experimental development... I mean the poor dev would then need to re-stitch the entire integration between the repos, while commenting out the Nexus dependencies - which can be very time consuming a context switch and also a small nightmare :(... – matanster Nov 05 '14 at 20:22
  • @Imm you mention a lot of architecting around deployables, what sizes of deployables are we talking about in that scenario? (my jar files aren't that large, at least before packaging them along with all their dependencies for production, and I have a micro-service architecture). – matanster Nov 05 '14 at 20:25
  • 1
    I've never had a problem using the free version of nexus, with repositories on the order of a terabyte; I think the paid version adds additional features rather than it being a question of size or anything like that. – lmm Nov 05 '14 at 23:23
  • 1
    Yes I avoid git submodules or subtree. Since I try to keep things together in a single git repository as far as possible, only splitting out when a project is logically separate or worked on by a different team, it's rare to want to make changes across several different git repositories (across maven modules that are versioned and released together is fine) - normally one team would make and test their own change and go through a release cycle, and only then would another team update their dependency. And remember that not every maven release has to correspond to a full deployment. – lmm Nov 05 '14 at 23:28
  • Note that for modules in the same git repository I have them all inherit (including their versions) from a common parent, and use ${project.version} when depending on another project in the same repository, so during development all the projects depend on the development versions of each other and changes will be reflected instantly (in eclipse) or the next time you build the whole repository (on the command line). Releases happen together, using the maven release plugin, so there's a single tag and common version for any release. – lmm Nov 05 '14 at 23:30
  • 1
    That said, when you do need to develop with projects from different repositories, it's pretty simple to link them up - just change the version of the dependency to the relevant -SNAPSHOT, and in eclipse the projects just depend on each other and changes in one are instantly reflected in the other. On the command line you have to build the dependency before the dependent project, which is a hassle but fair enough really. The maven release plugin prevents you from doing a release with a -SNAPSHOT dependency so it enforces that you release the dependency first. – lmm Nov 05 '14 at 23:32
  • 1
    I don't think the size of the deployables makes a lot of difference. I've used this kind of structure on a ~500kloc monolithic-ish project with about 6 deployables, and I've used it on a ~20kloc microservicey project that had dozens of deployables. Some people get concerned about having a large number of maven modules for a relatively small amount of actual code, but I've yet to see it cause any practical problems. – lmm Nov 05 '14 at 23:35
4

I'm a little late here, but my 2 cents.

Most scala projects and/or any projects I've worked in my past jobs have ended up with a very similar structure. Usually with consensus with other team members (which helps to validate the decision). The only main philosophical difference has been to either separate projects on technical infrastructure layers or by business modules. Examples below:

Common Projects

  • App.Utils : Shared utility code used by all other projects ( minimial to 0 dependencies )
  • App.Core : Shared business code (models, core helpers, interfaces, types)

Option 1: Module separation

  • App.Inventory: The inventory module with services, database code, helpers
  • App.Orders : The order management module with services, database, helpers

This can be very convenient and easy to manage by business area and you can then deploy single modules as needed. You can also later decide to separate out the modules into separate APIs if needed ( with a shared code base still in utils, and core ). The disadvantage here is that the approach can make the number of projects swell.

Option 2: Tech layer separation

  • App.Database: Database access functions
  • App.Services : Core implementations of business services

In this approach all the logic / services for all areas are in the services project and likewise for the database. So the code for say the inventory is split between in the database and services projects. This allows separating by traditional technical tiers. This can be much faster for smaller projects.

Personally, I prefer the more modular separation in option 1. Its more scalable and generally feels simpler when making code changes.

-K