Why is Stopping/OnStop not called when scaling in in Azure?

Question

RoleEnvironment.Stopping/RoleEntryPoint.OnStop() are not called for dying instances when I reduce my role instance count. They are called when the instance is rebooted or the deployment is stopped. What am I doing wrong, or am I just not supposed to need to clean up in this case?

I have a simple worker role (VS2012 update 1, default Cloud project with one worker role, added smarx's Table storage trace listener). All code here; no other dependencies:

using System;
using System.Collections.Generic;
using System.Data.Services.Client;
using System.Diagnostics;
using System.Linq;
using System.Net;
using System.Threading;
using Microsoft.WindowsAzure;
using Microsoft.WindowsAzure.Diagnostics;
using Microsoft.WindowsAzure.ServiceRuntime;
using Microsoft.WindowsAzure.StorageClient;

namespace WorkerRole1
{
    public class WorkerRole : RoleEntryPoint
    {
        bool shouldRun = true;
        EventWaitHandle runFinished = new EventWaitHandle(true, EventResetMode.ManualReset);

        public override bool OnStart()
        {
            ServicePointManager.DefaultConnectionLimit = 12;
            RoleEnvironment.Stopping += (object sender, RoleEnvironmentStoppingEventArgs e) => {
                Trace.WriteLine("WorkerRole1 Stopping called", "Information");
                shouldRun = false;
            };
            return base.OnStart();
        }

        public override void Run()
        {
            runFinished.Reset();
            try {
                Trace.WriteLine("WorkerRole1 entry point called", "Information");
                while (shouldRun) {
                    Thread.Sleep(10000);
                    Trace.WriteLine("Working", "Information");
                }
                Trace.WriteLine("Finished", "Information");
            } finally {
                runFinished.Set();
            }
        }

        public override void OnStop()
        {
            Trace.WriteLine("OnStop: Waiting for Run() to finish", "Information");
            runFinished.WaitOne();
            Trace.WriteLine("OnStop: Run() finished", "Information");
            base.OnStop();
        }
    }

    public class LogMessage : TableServiceEntity
    {
        public DateTime Time { get; set; }
        public string Message { get; set; }
        public string InstanceId { get; set; }
        public string Category { get; set; }

        public LogMessage() { }
        public LogMessage(string message, string category)
        {
            Message = message;
            Category = category;
            Time = DateTime.UtcNow;
            InstanceId = RoleEnvironment.CurrentRoleInstance.Id;
            PartitionKey = RoleEnvironment.DeploymentId;
            RowKey = (DateTime.MaxValue.Ticks - Time.Ticks).ToString("d19");
        }
    }

    public class TableTraceListener : TraceListener
    {
        private TableServiceContext _context = null;
        private TableServiceContext context
        {
            get
            {
                if (_context == null) {
                    var tables = CloudStorageAccount
                        .Parse(RoleEnvironment.GetConfigurationSettingValue(
                            Attributes["connectionStringName"] ?? "DataConnectionString"))
                        .CreateCloudTableClient();
                    tables.CreateTableIfNotExist("log");
                    _context = tables.GetDataServiceContext();
                    _context.MergeOption = MergeOption.NoTracking;
                }
                return _context;
            }
        }

        protected override string[] GetSupportedAttributes() { return new[] { "connectionStringName" }; }

        public override void Write(string message, string category)
        {
            context.AddObject("log", new LogMessage(message, category));
            context.SaveChangesWithRetries();
        }

        public override void WriteLine(string message, string category) { Write(message + "\n", category); }
        public override void Write(string message) { Write(message, null); }
        public override void WriteLine(string message) { Write(message + "\n"); }
    }
}

Well, they should be called. Is it possible that they are called, something goes wrong and you don't see any output? — sharptooth, May 06 '13 at 06:02
@sharptooth, I don't see how, but how can I check? (All the code is shown in my question, and the program doesn't know the mechanism being used to stop it. During scale in I can see log messages from the main loop, and then they stop appear. In the other cases I see the various OnStop/Stopping log messages, and the main loop shutdown.) — Oliver Bock, May 07 '13 at 00:04
Two reasons possible. First, it's possible that the event and the method are indeed not called due to some bug present in your version of Azure SDK (note that you don't specify the version number). Second, it's possible that Azure Diagnostics doesn't have enough time to transfer the logs and the VM is terminated before it does that. Anyway `Stopping`/`OnStop` should be called when scaling down. — sharptooth, May 07 '13 at 10:37
@sharptooth, I am using SDK v1.8, but v1.7 for storage. I am putting logging in Table storage (see code) so Azure Diagnostics is not involved. — Oliver Bock, May 07 '13 at 21:41
I have submitted this to Microsoft as a Support request. I will post their response when it comes. — Oliver Bock, May 08 '13 at 04:02
Microsoft are still investigating. I have submitted a different example that uses SQL Azure to do the logging. — Oliver Bock, Jul 15 '13 at 05:57
Microsoft have archived my ticket, probably because I got a bit fed up with them when they called me just before bed to tell me nothing had happened. In a way it's a relief that they aren't hassling me. Although I wonder whether anything will come of this. The test team verified my report long ago; apparently it is now with some deeper part of Microsoft. — Oliver Bock, Sep 12 '13 at 03:17

score 2 · Answer 1 · answered Jul 16 '13 at 14:51

From my experimenting, it appears that the fabric controller removes the role IPs from the dynamic whitelist of the DB server during role scale down operations (before the role is fully removed).

If this is also the cause of your problem, then a possible workaround could be to manually add the IP range 0.0.0.0-255.255.255.255 to the DB server's whitelist (at the expense of some security). Or you could re-architect your application to write data/messages to queues instead of the DB during OnStop (for a worker role to copy to DB later).

Why is Stopping/OnStop not called when scaling in in Azure?

1 Answers1

Linked