What is the canonical way to do blue/green deployment with the Spring Cloud/Netflix stack on PWS?

Question

I'm experimenting with a setup that is very much like the one detailed in the image here: https://raw.githubusercontent.com/Oreste-Luci/netflix-oss-example/master/netflix-oss-example.png

In my setup, I'm using a client application (https://www.joedog.org/siege-home/), a proxy (Zuul), a discovery service (Eureka) and a simple microservice. Everything is deployed on PWS.

I want to migrate from one version of my simple microservice to the next without any downtime. Initially I started out with the technique described here: https://docs.cloudfoundry.org/devguide/deploy-apps/blue-green.html

In my opinion, this approach is not "compatible" with a discovery service such as Eureka. In fact, the new version of my service is registered in Eureka and receives traffic even before I can remap all the routes (CF Router).

This lead me to another approach, in which I rely on the failover mechanisms in Spring Cloud/Netflix:

I spin up a new (backwards compatible) version of my service.
When this version is picked up by Zuul/Eureka it starts getting 50% of the traffic.
Once I verified that the new version works correctly I take down the "old" instance. (I just click the "stop" button in PWS)

As I understand, Zuul uses Ribbon (load-balancing) under the hood so in that split second where the old instance is still in Eureka but actually shutting down, I expect a retry on the new instance without any impact on the client.

However, my assumption is wrong. I get a few 502 errors in my client:

Lifting the server siege...      done.

Transactions:               5305 hits
Availability:              99.96 %
Elapsed time:              59.61 secs
Data transferred:          26.06 MB
Response time:              0.17 secs
Transaction rate:          89.00 trans/sec
Throughput:             0.44 MB/sec
Concurrency:               14.96
Successful transactions:        5305
Failed transactions:               2
Longest transaction:            3.17
Shortest transaction:           0.14

Part of my application.yml

server:
  port: ${PORT:8765}

info:
  component: proxy

ribbon:
  MaxAutoRetries: 2   # Max number of retries on the same server (excluding the first try)
  MaxAutoRetriesNextServer: 2 # Max number of next servers to retry (excluding the first server)
  OkToRetryOnAllOperations: true # Whether all operations can be retried for this client
  ServerListRefreshInterval: 2000 # Interval to refresh the server list from the source
  ConnectTimeout: 3000 # Connect timeout used by Apache HttpClient
  ReadTimeout: 3000 # Read timeout used by Apache HttpClient

hystrix:
  threadpool:
      default:
        coreSize: 50
        maxQueueSize: 100
        queueSizeRejectionThreshold: 50
  command:
    default:
      execution:
        isolation:
          thread:
            timeoutInMilliseconds: 10000

I'm not sure what goes wrong.

Is this a technical issue?

Or am I making the wrong assumptions (I did read somewhere that POSTs are not retried anyway, which I don't really understand)?

I'd love to hear how you do it.

Thanks, Andy

score 2 · Answer 1 · answered May 05 '16 at 21:53

I've wondered about this also. I won't claim to have used Spring Cloud "In Anger". I've just been experimenting with it for a while.

Assumption: we assume that the source of truth for all instance state is stored in Eureka, then Eureka should be our mechanism of operational control. We can use Eureka to take an instance out of service by setting the instance state to OUT_OF_SERVICE. When Ribbon refreshes its server list it will not use these out of service instances. Eureka provides a REST API for querying instances and setting instance state. Great.

The problem is: How do I identify which instances are in the Blue group and which instances are in the Green group?

I was thinking... Eureka provides a metadata map for each instance. Say in our build / bake step we set a version id in the metadata map? We could use a Git commit Id or some semantic versioning scheme or whatever. Ok, now I can look at the Eureka metadata and identify Blue versus Green instances given that version value. We can set the metadata values in each service using properties.

e.g. eureka.instance.metadataMap.version=8675309

Now what would be nice is if we could just tell Eureka. "Take all the instances for the FUBAR service and version 8675309 out of service." Well, I don't think that provided out of the box. The cool thing about Spring Cloud is that all these services, including Eureka Server, are just Spring apps that we can hack for our own needs. The code below exposes an end point that sets instances to "out of service" given an App Name and a Version. Just add this controller to your Eureka Server. It's not production ready, just an idea really.

Now once Eureka takes these instances out of service and Ribbon refreshes its server list it is safe to kill or route away from these instances.

POST to:

http://[eurekahost:port]/takeInstancesOutOfService?applicationName=FOOBAR&version=8675309

Hope that helps?

import java.util.Collection;
import java.util.function.Predicate;
import java.util.stream.Collectors;

import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

import com.netflix.appinfo.InstanceInfo;
import com.netflix.appinfo.InstanceInfo.InstanceStatus;
import com.netflix.discovery.shared.Application;
import com.netflix.eureka.EurekaServerContextHolder;
import com.netflix.eureka.registry.PeerAwareInstanceRegistry;

@RestController
public class EurekaInstanceStateController {

    @RequestMapping(value="/instancesQuery", method=RequestMethod.POST)
    public Collection<String> queryInstancesByMetaData(
            @RequestParam("applicationName") String applicationNameCriteria,
            @RequestParam("version") String versionCriteria)
    {
        return getRegistry().getSortedApplications()
                .stream()
                .filter(hasApplication(applicationNameCriteria))
                .flatMap(app -> app.getInstances().stream())
                .filter(hasVersion(versionCriteria))
                .map(info -> info.getAppName() + " - " + info.getId() + " - " + info.getStatus() + " - " + info.getMetadata().get("version"))
                .collect(Collectors.toList());
    }

    @RequestMapping(value="/takeInstancesOutOfService", method=RequestMethod.POST)
    public Collection<String> takeInstancesOutOfService(
            @RequestParam("applicationName") String applicationNameCriteria,
            @RequestParam("version") String versionCriteria)
    {
        return getRegistry().getSortedApplications()
                .stream()
                .filter(hasApplication(applicationNameCriteria))
                .flatMap(app -> app.getInstances().stream())
                .filter(hasVersion(versionCriteria))
                .map(instance -> updateInstanceStatus(instance, InstanceStatus.OUT_OF_SERVICE) )
                .collect(Collectors.toList());
    }

    /**
     * @param instance
     * @return
     */
    private String updateInstanceStatus(InstanceInfo instance, InstanceStatus status)
    {
        boolean isSuccess = getRegistry().statusUpdate(instance.getAppName(), instance.getId(),
        status, String.valueOf(System.currentTimeMillis()),
        true);

        return (instance.getAppName() + " - " + instance.getId() + " result: " + isSuccess);
    }

    /**
     * Application Name Predicate
     * @param applicationNameCriteria
     * @return
     */
    private Predicate<Application> hasApplication(final String applicationNameCriteria)
    {
        return application -> applicationNameCriteria.toUpperCase().equals(application.getName());
    }

    /**
     * Instance Version Predicate.  Uses Eureka Instance Metadata value name "version".</br>
     * 
     * Set / Bake the instance metadata map to contain a version value.</br>  
     * e.g. eureka.instance.metadataMap.version=85839c2
     * 
     * @param versionCriteria
     * @return
     */
    private Predicate<InstanceInfo> hasVersion(final String versionCriteria)
    {
        return info -> versionCriteria.equals(info.getMetadata().get("version"));
    }

    private PeerAwareInstanceRegistry getRegistry() {
        return EurekaServerContextHolder.getInstance().getServerContext().getRegistry();
    }
}

Good idea. I'm looking into this too. But I'm not sure about doing this on the Eureka side - if the service sends a new heartbeat, won't it change its state back to UP again? Spring cloud comes with /pause and /resume endpoint, which I think changes the client state to OUT_OF_SERVICE or DOWN. I was thinking about a deployment script that submit to /pause before deployment. The list of instances to submit this to could still be pulled from Eureka and filtered by version or something. — nedenom, May 06 '16 at 22:11
I was looking into the OUT_OF_SERVICE state too. From what I understand, it looks like Asgard takes a similar approach: https://github.com/Netflix/asgard/wiki/Eureka-Integration My conclusion is that, in order to implement rolling updates on PWS, we need a custom, homebrewn dashboard (such as Asgard) that will facilitate this. The PWS view is too limited to do this. AFAIK there is no Spring library that does this. I hadn't realised I could develop my own REST endpoints for this as you did, so I started out with the REST api of Eureka itself. I will have a look at that - thanks! — Andy Verbunt, May 09 '16 at 07:10
@nedenom if you set the status to DOWN it will automatically be set to UP again after 30 seconds. If you set the status to OUT_OF_SERVICE, it will stay that way until you manually (via REST api) set it back UP/DOWN. — Andy Verbunt, May 09 '16 at 07:16
Netflix has deprecated Asgard and are now using Spinnaker (www.spinnaker.io), which looks promising with support for several cloud providers. I just had a quick browse at the site and at least on the cloud provider setup page they had some instructions for PWS. — nedenom, May 10 '16 at 05:27

What is the canonical way to do blue/green deployment with the Spring Cloud/Netflix stack on PWS?

1 Answers1