3

I'm running Mesos and Ceph clusters on CoreOS with a working Ceph RBD Docker volume plugin, but it's very unclear to me how this can be used with Mesos/Marathon... Creating/using rbd volumes for single Docker containers is working flawlessly though.

I can find no article/blog post/whatever that deals with the automated creation (and, in case of "task migration" between Mesos slaves, remapping) of these volumes via Marathon. Especially important to me is how to run multiple instances of a stateful service when each instance needs to have it's own volume (imagine a MongoDB ReplicaSet on Mesos/Marathon).

I know the Mesos persistent volume docs, and I also saw the Marathon issue, but still I'm very confused how or when this will be usable...

There are also other questions here on SO:

which unfortunately don't really have an answer to this specific problem.

The EMC Code example with RexRay also only covers a single instance example, which I could also handle with ease with the volume plugin mentioned above:

{
    "id": "nginx",
    "container": {
        "docker": {
            "image": "million12/nginx",
            "network": "BRIDGE",
            "portMappings": [{
                "containerPort": 80,
                "hostPort": 0,
                "protocol": "tcp"
            }],
            "parameters": [{
                "key": "volume-driver",
                "value": "rbd"
            }, {
                "key": "volume",
                "value": "nginx-data:/data/www"
            }]
        }
    },
    "cpus": 0.2,
    "mem": 32.0,
    "instances": 1
}

The nginx-data volume would be created automatically in this case. But what if I want to use persistent volumes and multiple instances?

Community
  • 1
  • 1
Tobi
  • 31,405
  • 8
  • 58
  • 90

1 Answers1

3

This is a use case that Flocker is meant to solve. (Disclaimer: I'm the CTO at ClusterHQ). See this blog post for a demo of the Flocker <=> Mesos/Marathon interaction, which shows how the Flocker Control Service can act as the "source of truth" for which container volumes exist in a clustered setting. Flocker will then create on-demand, and then co-ordinate mapping and unmapping these volumes between hosts as the containers that reference these volumes move around in the cluster.

Flocker does this by providing a cluster-wide namespace of volume names, these names can then be used via the Flocker plugin for Docker with Marathon to provide portability and high availability for stateful containers in a Mesos cluster.

Flocker also has a Ceph driver:

  • Google "Flocker Ceph Driver"

And works on CoreOS:

  • Google "Flocker on CoreOS demo"

You can run multi-instance jobs (like MongoDB with replica sets) by giving each container its own volume name (like mongo_1, mongo_2, etc).

Putting these pieces together would be non-trivial, but I'd be happy to help. I could write up a detailed guide specifically for your stack (Ceph + CoreOS + Docker + Mesos + Marathon) if you like.

Luke Marsden
  • 131
  • 3
  • Thanks for your fast and detailled answer! I saw the tutorial at https://clusterhq.com/2015/10/06/marathon-ha-demo/ some time ago, but as far as I understand this also covers only the one-instance use case. Is there any other article that covers scaling? – Tobi Jan 26 '16 at 13:02
  • Sorry, another question: Concerning distinct volume names, does this need to be set up manually? If so, I don't really understand how this can work when I start a new application via a JSON definition in Marathon. Ideally, the volume creation/migration/scaling process should be transparent to the user. Otherwise I don't see the benefit of using Marathon... But maybe I misunderstood this. – Tobi Jan 26 '16 at 14:00
  • You'd need some way to automatically number the volumes. We have an idea with Flocker to implement "MultiVolumes", these would be volumes where you could say `flockerctl scale mongodb=5` and Flocker would allocate 5 usable volumes for the MultiVolume `mongodb`, then each time you start a container with this volume name it would give you back one of these volumes which isn't being used by another container. Would that work for your use case do you think @Tobi? – Luke Marsden Jan 26 '16 at 14:52
  • Also, see https://github.com/mesosphere/marathon/issues/2493 Persistent volume support is now scheduled for Marathon 0.16 (was 0.14 before...). – Tobi Jan 26 '16 at 15:22
  • Could be a solution, though I'm not sure if this can really work without being tightly integrated into Mesos/Marathon. When I want to scale, I want to do this in 1 step. For my understanding of http://mesos.apache.org/documentation/latest/persistent-volume/ a persistent volume would need to be created per instance of the Marathon application, and Marathon would need to store the mapping between task id and pers. vol. id (I assume in ZooKeeper). If a task goes down and is restarted, Marathon would need to get the pv id from ZooKeeper and automatically attach the pv to the task's container... – Tobi Jan 26 '16 at 15:27
  • @Tobi, thanks for the replies! How would you envisage scaling down to work in this case? If you scale down from 10 containers -> 5, would you expect the 5 spare volumes to be destroyed? In my mind, this is why it's useful to distinguish between container scaling and volume scaling, which is what led us to suggest the MultiVolume idea... interested in your thoughts here! – Luke Marsden Jan 27 '16 at 17:41
  • That's an interesting question, and some colleagues an me were also wrapping our minds around that. I think it would actually depend on the type of stateful service, and the way it uses replication and failover mechanisms. E.g. MongoDB handles that differently as Elasticsearch. I'd really be interested on the "official" Mesosphere position on stateful services, but there's nothing unfortunately, which is a pity IMHO. I followed you on Twitter, maybe we can continue the discussion via PM. – Tobi Jan 28 '16 at 09:49