1

Background: I'm working on an ASP.NET (classic) WebAPI that makes use of a data-set retrieved from an external web service to service its requests.

  • The data-set is immutable
  • The data-set reasonably expensive to fetch from the external web service (could take 5-10 seconds).
  • Once the data-set has been fetched, accessing the data needs to be fast
  • The data-set needs to be periodically updated (replaced) in the background.
  • The periodic updates must not affect the performance of code that's reading the data-set
  • It's not a problem if different threads see versions of the immutable data-set however any given call to IDataSetProvider.GetLatestAsync must return the most recently fetched data-set
  • It's not a problem if calls to IDataSetProvider.GetLatestAsync take a while when the AppPool is started/restarted (i.e. data not fetched yet) however subsequent calls should be essentially instant.

Implementation Summary:

  • A DataSetProvider class that holds a reference to the "latest" data-set value & exposes an interface for consuming code to access the latest value. This is registered as a single-instance in the IoC container
  • A worker that starts a long-running task responsible for updating the "latest" data-set value on the DataSetProvider. This worker is again single-instance in the IoC container and is automatically activated (using Ninject here)

Questions:

  • I don't think I need to use a semaphore to protect access to the _latestDataSet field because updates to references in C# are atomic & it's only ever going to be updated by one thread at a time (could be different threads due to async continuations). Am I correct about this?
  • Do I need the volatile keyword on _latestDataSet to ensure the thread calling IDataSetProvider.GetLatestAsync gets the latest value (and doesn't, for example, get null forever due to JITing optimizations or whatever)

Implementation (unimportant details elided for brevity):

public interface IDataSetProvider
{
    Task<DataSet> GetLatestAsync();
}

public interface IDataSetUpdater
{
    void Update(DataSet latestDataSet);
}

public class DataSetProvider : IDataSetProvider, IDataSetUpdater
{
    private static readonly TimeSpan InitialFetchTimeout = TimeSpan.FromSeconds(20);
    private static readonly TimeSpan InitialDataPollingInterval = TimeSpan.FromMilliseconds(100);

    private volatile DataSet _latestDataSet;

    public async Task<DataSet> GetLatestAsync()
    {
        TimeSpan elapsed = TimeSpan.Zero;

        // Updated by DataSetProviderWorker task. Only null briefly on application pool startup.
        while (_latestDataSet == null)
        {
            if (elapsed >= InitialFetchTimeout)
            {
                throw new InvalidOperationException($"Data has not been populated & timeout ({InitialFetchTimeout}) reached.");
            }

            await Task.Delay(InitialDataPollingInterval);
            elapsed += InitialDataPollingInterval;
        }

        return _latestDataSet;
    }

    public void Update(DataSet latestDataSet)
    {
        _latestDataSet = latestDataSet;
    }
}

public class DataSetProviderWorker : IStartable
{
    private static readonly TimeSpan Period = TimeSpan.FromSeconds(20);

    private readonly IDataSetUpdater _dataSetUpdater;
    private readonly CancellationTokenSource _cancellationTokenSource;

    public void Start()
    {
        new TaskFactory().StartNew(RunAsync, _cancellationTokenSource.Token, TaskCreationOptions.LongRunning);
    }

    private async Task RunAsync(object obj)
    {
        while (!_cancellationTokenSource.IsCancellationRequested)
        {
            try
            {
                await DoWorkAsync().ConfigureAwait(false);
            }
            catch (System.Exception e)
            {
                _logger.LogError(e, $"{nameof(DataSetProviderWorker)} encountered an unhandled exception in the execution loop. Will retry in {Period}");
            }

            // Delay execution even if DoWorkAsync threw an exception. Could be a transient error, in which case we don't want to spam
            await Task.Delay(Period).ConfigureAwait(false);
        }
    }

    private async Task DoWorkAsync()
    {
        var immutableData = await _externalDataSource.FetchData();
        var latestDataSet = new DataSet(immutableData);
        _dataSetUpdater.Update(latestDataSet);
    }

    public void Stop()
    {
        _cancellationTokenSource.Cancel();
    }
}
Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
eddiewould
  • 1,555
  • 16
  • 36
  • 1
    If this was a database, my advise would be a Distributed design. Using a in-process DB at the server as a cache. Of course, you can also easily do that kind of stuff with WebServices. Write your own one, whose whole purpose is to provide a cached copy of the originals data. That way you would not have to deal with many threads working with it. – Christopher May 09 '20 at 22:04
  • 1
    Are you talking about the `System.Data.DataSet` class, or for a custom class with the same name? Because the `System.Data.DataSet` class is certainly not immutable. – Theodor Zoulias May 09 '20 at 22:35
  • 1
    As a side note, you can get a more consistent interval between updates if you create the `Task.Delay(Period)` task before the `await DoWorkAsync()`, and await it afterwards. – Theodor Zoulias May 09 '20 at 22:39
  • See marked duplicate for the exact answer to when you would need `volatile`. Based on your incomplete code example, you may or may not need `volatile` and it may or may not be sufficient. There is a race condition, so part of the answer would depend on whether you care about the code duplicating effort. If you need more help, post a question with a good [mcve], explain what about the code isn't working the way you want, and what _specifically_ you need help with. Be sure to limit the post _to a single question_, otherwise it's too broad for Stack Overflow. – Peter Duniho May 09 '20 at 22:41
  • @TheodorZoulias Sorry, it's a class from another namespace. – eddiewould May 09 '20 at 23:51
  • @PeterDuniho for my benefit, would you mind spelling out the race condition / duplicated effort you mentioned? – eddiewould May 10 '20 at 00:02
  • 1
    The `volatile` keyword may not be strictly required in this case, provided that you are planning to install your application in a machine with a known CPU architecture, and you have studied the memory model of the said CPU architecture in accordance with the specification of the C# language. But I guess that no sane person would be happy to invest that much mental energy in order to achieve (hopefully) a slightly faster implementation with inherent portability limitations. So I would say that practically it's required. – Theodor Zoulias May 10 '20 at 00:29
  • @TheodorZoulias that counts as "strictly required" according to my definition. Is it _sufficent_ though? Or do I need to use a `SemaphoreSlim` or similar? – eddiewould May 10 '20 at 02:06
  • _"would you mind spelling out the race condition / duplicated effort you mentioned?"_ -- it depends on the rest of the code. If you are sure that `Update()` is only ever called in exactly one place, it's probably fine. But otherwise, the `_latestDataSet` could be set by multiple places, and callers to `GetLatestAsync()` could receive different values. Even a single caller to that method could receive a different value than the one that was the value that caused the loop to terminate. The code doesn't provide enough context for me to say for sure that the problem exists, only that it _could_. – Peter Duniho May 10 '20 at 04:27
  • 1
    Yes, it is sufficient. All threads will see the new immutable "dataset" practically instantly. No `SemaphoreSlim` or `lock` is required. If you are concerned about performance though, you should try to minimize the access to the volatile field. For example the `GetLatestAsync` is accessing the `volatile _latestDataSet` twice, so it probably pays the cost of volatility twice. If you are willing to trade data freshness for performance you could have each thread maintain a local reference of a "dataset", and update this reference (from the `volatile` shared field) every 10 iterations or so. – Theodor Zoulias May 10 '20 at 09:56
  • 1
    Also, since you use `await`, keep in mind that the memory barrier(s) injected by `volatile` may be redundant in your case, because `await` injects memory barrier(s) too. Comments by Stephen Cleary: *It is not documented, but Microsoft people have assured me that the barriers are always present if await causes a thread switch* [(citation)](https://stackoverflow.com/questions/59649998#comment105538283_59649998). *await injects all necessary thread barriers, so there's no issues around out-of-order reads or anything like that* [(citation)](https://stackoverflow.com/a/59503720/11178549). – Theodor Zoulias May 10 '20 at 10:07

0 Answers0