0

Very simply I am storing a TimeSeries as a DateTime dictionary with a signature of IDictionary<Datetime, double?>. This would contain a month of 10 min resolution data, so up to 4,464 entries.

In order to process sections of this time series we need to extract a section between a start and an end DateTime.

A naïve way of doing this is to obtain a subset of the Dictionary keys for the range we're interested in:

var reducedKeys = timeSeries.Keys.Where(k => k >= start && k <= end).ToList();

Then extract the relevant section from the large timeSeries

var reducedTimeSeries = timeSeries.Where(kvp => reducedKeys .Contains(kvp.Key)).ToDictionary(w => w.Key, w => w.Value);

This doesn't feel like the most optimal solution; Any suggestions for a faster extraction strategy?

For clarity, the ordering of the time stamps is largely irrelevant at this stage as the higher level calculations are happening across multiple time series rather than within the same series. There is a flatline filter to run, after the extraction; but this is ok to run by iterating over a sorted copy of the keys from the time series extract as we would typically have a 12-24 sample series after extraction from the longer source series.

Ross Halliday
  • 785
  • 6
  • 19
  • 1
    "better" and "optimal" are both broad words, encompassing many different factors. Please tell us what metric(s) you're most interested in. And "all of them" is not acceptable :). Please also read [Eric Lippert's blog post on performance](https://ericlippert.com/2012/12/17/performance-rant/). – Heretic Monkey Nov 24 '20 at 13:34
  • Thanks @HereticMonkey that's a good read :-) In this case it's definitely speed that we're looking at as the operation is within a large loop operation that's not amenable to set based processing. – Ross Halliday Nov 24 '20 at 13:59
  • If you don't mind adding a third-party dependency to your app, you could consider using the [C5](https://github.com/sestoft/C5) library. It has a `TreeDictionary` collection that contains the method `RangeFromTo`, that returns efficiently a range of keys inside the collection. – Theodor Zoulias Nov 24 '20 at 15:07
  • If the offset is always from the same date, then you can just use an array with the start date, and each subsequent element is 10 minutes of offset – Stanislav Nov 24 '20 at 18:49

5 Answers5

3

Dictionary<TKey, TValue> stores its entries in buckets based on the hash of the TKey values. Simply put, your time series entries are not stored in order. This makes working with datetime ranges very inefficient, because you will need to enumerate all items to get the appropriate ones.

You could consider a SortedDictionary<TKey, TValue>, as it combines hashing and ordering of the TKey values. See this response: https://stackoverflow.com/a/9053294/1323798.

I'd personally look for data structures that are specifically designed for use with time series data if you care about performance. But as always, it all depends on what you intend to do with your time series.

mstaessen
  • 1,227
  • 11
  • 16
  • I've always been a bit wary of using SortedDictionary, but in this case it may be a good option as we build the datasets once on extraction from the DB then use the data extensively afterwards. It's worth a trial :-) – Ross Halliday Nov 24 '20 at 13:57
2

Don't know if that's the most optimal solution, but its definitely shorter and easier to understand

var reducedTimeSeries = timeSeries.Where(x => x.Key >= start && x.Key <= end).Select(x => x)
dontbyteme
  • 1,221
  • 1
  • 11
  • 23
0

If SortedDictionary doesn't suit you, then you can create your own custom solution.

For example if the offset is constant you can store data in an array and search by offset.

In my case, the custom solution was 3 times faster

//(1000 iterations)
// DateTimeArray:00:00:00.1629569
// SortedDictionary:00:00:00.5831970

public sealed class DateTimeArray<TValue>
{
    private readonly DateTime _startDate;
    private readonly int _step;
    private readonly int _size;
    private TValue[] _values;
    public DateTimeArray(DateTime startDate, int step = 10, int size = 4464)
    {
        if (step <= 0)
            throw new InvalidDataException("step can not be less than 1");
        _startDate = startDate;
        _step = step;
        _size = size;
        _values = new TValue[size];
    }

    public DateTimeArray(DateTime startDate, DateTime endDate, int step = 10)
        : this(startDate, step, (int) ((endDate - startDate).TotalMinutes / step))
    {
    }

    public void Add(DateTime date, TValue value)
    {
        var offset = (int)((date - _startDate).TotalMinutes);
        var current = offset / _step;
        if (_size <= current)
            throw new IndexOutOfRangeException($"{current}>={_size}");
        _values[current] = value;
    }

    public TValue[] Between(DateTime from, DateTime? to = null)
    {
        var offsetFrom = (int)((from - _startDate).TotalMinutes) / _step;
        var offsetTo = _size;
        if (to.HasValue)
        {
            offsetTo = (int) ((to.Value- _startDate).TotalMinutes) / _step;
        }

        if (offsetFrom >= offsetTo)
            throw new IndexOutOfRangeException($"{offsetFrom}>={offsetTo}");
        return _values.Skip(offsetFrom).Take(offsetTo - offsetFrom ).ToArray();
    }
}
Stanislav
  • 459
  • 3
  • 6
0

Dictionary<DateTime, double?> implement interface IEnumerable<KeyValuePair<DateTime, double?>>. So if you want all values in the dictionary that are between two DateTimes, use this interface to select the key-value pairs that you want to put in your new dictionary.

DateTime startTime = ...
Datetime endTime = ...
Dictionary<DateTime, double?> dictionary = ...

var itemsToPutInNewDictionary = dictionary
    .Where(keyValuePair => startTime <= keyValuePair.Key
                                     && endTime >= keyValuePair.Key);
// note: the query is not executed yet!
var limitedDictionary = new Dictionary<DateTime, double?>(itemsToPutInNewDictionary);

Or you can use ToDictionary at the end of the LINQ:

.ToDictionary(keyValuePair => keyValuePair.Key, keyValuePair => keyValuePair.Value);

You'll have to look at how a dictionary is built, but I think the latter solution is slower.

Harald Coppoolse
  • 28,834
  • 7
  • 67
  • 116
-1

You could copy the contents of the Dictionary into a sorted array, and then use the Array.BinarySearch method to search into the array for dates with O(log n) efficiency.

Below is an extension method RangeFromTo for sorted copies of dictionaries. It returns a range of the contents that fall between two keys:

public static IEnumerable<KeyValuePair<TKey, TValue>> RangeFromTo<TKey, TValue>(
    this KeyValuePair<TKey, TValue>[] sortedArray,
    TKey from, TKey to)
{
    var keyComparer = Comparer<TKey>.Default;
    var pairComparer = Comparer<KeyValuePair<TKey, TValue>>.Create(
        (x, y) => keyComparer.Compare(x.Key, y.Key));

    int fromIndex = Array.BinarySearch(sortedArray,
        new KeyValuePair<TKey, TValue>(from, default), pairComparer);
    int toIndex = Array.BinarySearch(sortedArray,
        new KeyValuePair<TKey, TValue>(to, default), pairComparer);

    if (fromIndex < 0) fromIndex = ~fromIndex;
    if (toIndex < 0) toIndex = Math.Min(~toIndex, sortedArray.Length - 1);

    for (int i = fromIndex; i <= toIndex; i++)
    {
        yield return sortedArray[i];
    }
}

Usage example:

var dictionary = new Dictionary<DateTime, double>();
/* Code that fills the dictionary omitted */

var sortedArray = dictionary.OrderBy(e => e.Key).ToArray();

var selection = sortedArray.RangeFromTo(
    new DateTime(2020, 1, 1), new DateTime(2020, 12, 31));

foreach (var pair in selection)
{
    DateTime date = pair.Key;
    double value = pair.Value;
    Console.WriteLine($"{date:yyyy/MM/dd} => {value}");
}

There is one gotcha: you must recreate the sorted array every time the dictionary is updated. These two must always be in sync.


Warning: You should make enough searches per copy, to justify its cost. Ideally you should be able to use a single copy for many-many searches. In the extremely unfavorable case that you use each copy for a single search, the performance will be significantly worse than the simple linear search.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
  • Ordering the data is O(n*log(n)). Doing the linear search is O(n). You've just made it perform significantly *worse*, not better. – Servy Nov 24 '20 at 21:50
  • @Servy yeap, if you have to sort the dictionary every time you search, it will surely be worse. Hopefully the OP has a dictionary that is not updated very often, but needs to be searched frequently. – Theodor Zoulias Nov 24 '20 at 22:23