6

Input:

    public class MyObject
    {
        public double Value { get; set; }
        public DateTime Date { get; set; }
    }

Method to generate test objects:

public static MyObject[] GetTestObjects()
{
    var rnd = new Random();
    var date = new DateTime(2021, 1, 1, 0, 0, 0);
    var result = new List<MyObject>();
    for (int i = 0; i < 50000; i++)
    {
        //this is to simulate real data having gaps
        if (rnd.Next(100) < 25)
        {
            continue;
        }
        var myObject = new MyObject()
        {
            Value = rnd.NextDouble(),
            Date = date.AddMinutes(15 * i)
        };
        result.Add(myObject);
    }
    return result.ToArray();
}

Given this I require to calculate maximum Value for previous 12 month for each myObject. I could just think of doing this InParallel, but maybe there is an optimized solution?

Sorry for being unclear, this is what I use right now to get what I want:

        public MyObject[] BruteForceBackward(MyObject[] testData)
        {
            return testData.AsParallel().Select(point =>
            {
                var max = testData.Where(x => x.Date <= point.Date && x.Date >= point.Date.AddYears(-1)).Max(x => x.Value);
                return new MyObject() { Date = point.Date, Value = point.Value / max };
            }).OrderBy(r => r.Date).ToArray();
        }

This works but it is slow and eats processor resources (imagine, you have 100k objects), I believe there must be something better

Nick Farsi
  • 366
  • 4
  • 19
  • 1
    Perhaps you need [this](https://stackoverflow.com/questions/12190184/can-min-max-of-moving-window-achieve-in-on/12195098#12195098) – MBo Dec 29 '21 at 10:32
  • 2
    @nick farsi, does my answer solve your wuestion? I'd be happy for the bounty :D – julian bechtold Jan 06 '22 at 22:40

4 Answers4

5

I had a simillar project where i had to calculate such stuff on tons of sensor data.

You can now find a little more refined version in my Github repository, which should be ready to use (.Net): https://github.com/forReason/Statistics-Helper-Library

In general you want to reduce the amount of loops going over all your data. At best, you want to touch each element only one single time.

Process Array (equiv. of BruteForceBackwards)

public static MyObject[] FlowThroughForward(ref MyObject[] testData)
{
    // generate return array
    MyObject[] returnData = new MyObject[testData.Length];
    // keep track to minimize processing
    double currentMaximum = 0;
    List<MyObject> maximumValues = new List<MyObject>();
    // go through the elements
    for (int i = 0; i < testData.Length; i++)
    {
        // calculate the oldest date to keep in tracking list
        DateTime targetDate = testData[i].Date.AddYears(-1);
        // maximum logic
        if (testData[i].Value >= currentMaximum)
        {
            // new maximum found, clear tracking list
            // this is the best case scenario
            maximumValues.Clear();
            currentMaximum = testData[i].Value;
        }
        else
        {
            // unfortunately, no new maximum was found
            // go backwards the maximum tracking list and check for smaller values
            // clear the list of all smaller values. The list should therefore always
            // be in descending order
            for (int b = maximumValues.Count - 1; b >= 0; b--)
            {
                if (maximumValues[b].Value <= testData[i].Value)
                {
                    // a lower value has been found. We have a newer, higher value
                    // clear this waste value from the tracking list
                    maximumValues.RemoveAt(b);
                }
                else
                {
                    // there are no more lower values. 
                    // stop looking for smaller values to save time
                    break;
                }
            }
        }
        // append new value to tracking list, no matter if higher or lower
        // all future values might be lower
        maximumValues.Add(testData[i]);
        // check if the oldest value is too old to be kept in the tracking list
        while (maximumValues[0].Date < targetDate)
        {
            // oldest value is to be removed
            maximumValues.RemoveAt(0);
            // update maximum
            currentMaximum = maximumValues[0].Value;
        }
        // add object to result list
        returnData[i] = new MyObject() { Date = testData[i].Date, Value = testData[i].Value / currentMaximum }; ;
    }
    return returnData;
}

Real Time Data or Streamed Data

Note: If you have really large lists, you might get memory issues with your approach to pass a full array. In this case: pass one value at a time, pass them from oldest value to newest value. Store the values back one at a time. This Function can also be used on real time data.
The test method is included in code.

static void Main(string[] args)
{
    int length = 50000;
    
    Stopwatch stopWatch1 = new Stopwatch();
    stopWatch1.Start();
    var myObject = new MyObject();
    var result = new List<MyObject>();
    var date = new DateTime(2021, 1, 1, 0, 0, 0);
    for (int i = 0; i < length; i++)
    {
        //this is to simulate real data having gaps
        if (rnd.Next(100) < 25)
        {
            continue;
        }
        myObject.Value = rnd.NextDouble();
        myObject.Date = date.AddMinutes(15 * i);
        result.Add(CalculateNextObject(ref myObject));
    }
    stopWatch1.Stop();
    Console.WriteLine("test code executed in " + stopWatch1.ElapsedMilliseconds + " ms");
    Thread.Sleep(1000000);
}

private static Random rnd = new Random();
private static double currentMaximum = 0;
private static List<MyObject> maximumValues = new List<MyObject>();
public static MyObject CalculateNextObject(ref MyObject input)
{
        // calculate the oldest date to keep in tracking list
        DateTime targetDate = input.Date.AddYears(-1);
        // maximum logic
        if (input.Value >= currentMaximum)
        {
            // new maximum found, clear tracking list
            // this is the best case scenario
            maximumValues.Clear();
            currentMaximum = input.Value;
        }
        else
        {
            // unfortunately, no new maximum was found
            // go backwards the maximum tracking list and check for smaller values
            // clear the list of all smaller values. The list should therefore always
            // be in descending order
            for (int b = maximumValues.Count - 1; b >= 0; b--)
            {
                if (maximumValues[b].Value <= input.Value)
                {
                    // a lower value has been found. We have a newer, higher value
                    // clear this waste value from the tracking list
                    maximumValues.RemoveAt(b);
                }
                else
                {
                    // there are no more lower values. 
                    // stop looking for smaller values to save time
                    break;
                }
            }
        }
        // append new value to tracking list, no matter if higher or lower
        // all future values might be lower
        maximumValues.Add(input);
        // check if the oldest value is too old to be kept in the tracking list
        while (maximumValues[0].Date < targetDate)
        {
            // oldest value is to be removed
            maximumValues.RemoveAt(0);
            // update maximum
            currentMaximum = maximumValues[0].Value;
        }
    // add object to result list
    MyObject returnData = new MyObject() { Date = input.Date, Value = input.Value / currentMaximum };
    return returnData;
}

Test Method

static void Main(string[] args)
{
    MyObject[] testData = GetTestObjects();
    Stopwatch stopWatch1 = new Stopwatch();
    Stopwatch stopWatch2 = new Stopwatch();
    stopWatch1.Start();
    MyObject[] testresults1 = BruteForceBackward(testData);
    stopWatch1.Stop();
    Console.WriteLine("BruteForceBackward executed in " + stopWatch1.ElapsedMilliseconds + " ms");
    stopWatch2.Start();
    MyObject[] testresults2 = FlowThroughForward(ref testData);
    stopWatch2.Stop();
    Console.WriteLine("FlowThroughForward executed in " + stopWatch2.ElapsedMilliseconds + " ms");
    Console.WriteLine();
    Console.WriteLine("Comparing some random test results: ");
    var rnd = new Random();
    for (int i = 0; i < 10; i++)
    {
        int index = rnd.Next(0, testData.Length);
        Console.WriteLine("Index: " + index + " brute: " + testresults1[index].Value + " flow: " + testresults2[index].Value);
    }
    Thread.Sleep(1000000);
}

Test result

Tests were performed on a machine with 32 cores, so in teory multithreaded aproach should be at advantage but youll see ;)

Function Function Time time %
BruteForceBackward 5334 ms 99.9%
FlowThroughForward 5 ms 0.094%

Performance improvement factor: ~time/1000

console output with data validation:

BruteForceBackward executed in 5264 ms
FlowThroughForward executed in 5 ms

Comparing some random test results:
Index: 25291 brute: 0.989688139105413 flow: 0.989688139105413
Index: 11945 brute: 0.59670821976193 flow: 0.59670821976193
Index: 30282 brute: 0.413238225210297 flow: 0.413238225210297
Index: 33898 brute: 0.38258761939139 flow: 0.38258761939139
Index: 8824 brute: 0.833512217105447 flow: 0.833512217105447
Index: 22092 brute: 0.648052464067263 flow: 0.648052464067263
Index: 24633 brute: 0.35859417692481 flow: 0.35859417692481
Index: 24061 brute: 0.540642018793402 flow: 0.540642018793402
Index: 34219 brute: 0.498785766613022 flow: 0.498785766613022
Index: 2396 brute: 0.151471808392111 flow: 0.151471808392111

Cpu usage was a lot higher on Bruteforce backwards due to parallelisation. enter image description here

The worst case scenario are long periods of decreasing values. The code can still be vastly optimized but I guess this should be sufficient. For further optimisation, one might look to reduce the list shuffles when removing/adding elements to maximumValues.

julian bechtold
  • 1,875
  • 2
  • 19
  • 49
  • 1
    For benchmark, benchmark.net is easy to setup and readable. It doesnt polute code with stopwatch. – Drag and Drop Jan 05 '22 at 08:45
  • thank you for the suggestion. I will not install third party tools in order to prove a point shortly. Normally I use the Builtin performance profiler of visual studio. – julian bechtold Jan 05 '22 at 08:48
  • 1
    Good job, my benchmarks show your algorithm outperforms one using a priority queue. – Funk Jan 06 '22 at 21:01
  • @Funk Thank you :) I don't know how a priority queue would be applied well to this Problem. It could be vastly improved still with a "circular list array" with moving start/end indexes. This would negate all the shuffling and adding/removing on "maximumValues" entirely and convert the list management to atomical operations. Bringing the calculation cost almost down to 0. – julian bechtold Jan 06 '22 at 22:45
  • shortly looked up what a priority queue is. If I understand it correcly, you sort the incoming array by value, then check the highest element, until it's datetime is out of scope. Is that correct? – julian bechtold Jan 06 '22 at 23:02
  • 1
    @Julian That's right, @MBo describes the workflow in his comment under OP. You can also clear the priority queue when a new maximum is found, to give it a fighting chance against your algorithm, but no cigar. With .NET 6 you can now find a priority queue implementation in the `System.Collections.Generic` namespace. – Funk Jan 07 '22 at 06:26
  • 1
    The default list, which is backed by an array, performs very well in this use case. As there's no double ended queue in the .NET BCL I've refactored your code to use the native doubly linked list, but that didn't perform nearly as well. To squeeze more performance out of it, you'd probably need to whip up something custom. – Funk Jan 07 '22 at 06:27
  • 2
    One thing to take into account, if there are gaps (deviating time intervals) in the input data, as is the case with OP, the last `if` should be replaced by a `while` to ensure all obsolete values are removed. – Funk Jan 07 '22 at 06:55
2

An interesting and challenging problem. I put together a solution using a dynamic programming approach (first learned back in CS algorithms class back in '78). First, a tree is constructed containing pre-calculated local max values over recursively defined ranges. Once constructed, the max value for an arbitrary range can be efficiently calculated mostly using the pre-calculated values. Only at the fringes of the range does the calculation drop down to the element level.

It is not as fast as julian bechtold's FlowThroughForward method, but random access to ranges may be a plus.

Code to add to Main:

    Console.WriteLine();
    Stopwatch stopWatch3 = new Stopwatch();
    stopWatch3.Start();
    MyObject[] testresults3 = RangeTreeCalculation(ref testData, 10);
    stopWatch3.Stop();
    Console.WriteLine($"RangeTreeCalculation executed in {stopWatch3.ElapsedMilliseconds} ms");

    ... test comparison
    Console.WriteLine($"Index: {index} brute: {testresults1[index].Value} flow: {testresults2[index].Value} rangeTree: {testresults3[index].Value}");

Test function:

public static MyObject[] RangeTreeCalculation(ref MyObject[] testDataArray, int partitionThreshold)
{
    // For this implementation, we need to convert the Array to an ArrayList, because we need a
    // reference type object that can be shared.
    List<MyObject> testDataList = testDataArray.ToList();

    // Construct a tree containing recursive collections of pre-calculated values
    var rangeTree = new RangeTree(testDataList, partitionThreshold);

    MyObject[] result = new MyObject[testDataList.Count];
    Parallel.ForEach(testDataList, (item, state, i) =>
        {
            var max = rangeTree.MaxForDateRange(item.Date.AddYears(-1), item.Date);
            result[i] = new MyObject() { Date = item.Date, Value = item.Value / max };
        });

    return result;
}

Supporting class:

// Class used to divide and conquer using dynamic programming.
public class RangeTree
{
    public List<MyObject> Data; // This reference is shared by all members of the tree
    public int Start { get; } // Index of first element covered by this node.
    public int Count { get; } // Number of elements covered by this node.
    public DateTime FirstDateTime { get; }
    public DateTime LastDateTime { get; }
    public double MaxValue { get; }  // Pre-calculated max for all elements covered by this node.
    List<RangeTree> ChildRanges { get; }

    // Top level node constructor
    public RangeTree(List<MyObject> data, int partitionThreshold)
        : this(data, 0, data.Count, partitionThreshold)
    {
    }
    
    // Child node constructor, which covers an recursively decreasing range of element.
    public RangeTree(List<MyObject> data, int start, int count, int partitionThreshold)
    {
        Data = data;
        Start = start;
        Count = count;
        FirstDateTime = Data[Start].Date;
        LastDateTime = Data[Start + Count - 1].Date;
        if (count <= partitionThreshold)
        {
            // If the range is smaller than the threshold, just calculate the local max
            // directly from the items. No child ranges are defined.
            MaxValue = Enumerable.Range(Start, Count).Select(i => Data[i].Value).Max();
        }
        else
        {
            // We still have a significant range. Decide how to further divide them up into sub-ranges.
            // (There may be room for improvement here to better balance the tree.)
            int partitionSize = (count - 1) / partitionThreshold + 1;
            int partitionCount = (count - 1) / partitionSize + 1;
            if (count < partitionThreshold * partitionThreshold)
            {
                // When one away from leaf nodes, prefer fewer full leaf nodes over more
                // less populated leaf nodes.
                partitionCount = (count - 1) / partitionThreshold + 1;
                partitionSize = (count - 1) / partitionCount + 1;
            }

            ChildRanges = Enumerable.Range(0, partitionCount)
                .Select(partitionNum => new {
                        ChildStart = Start + partitionNum * partitionSize,
                        ChildCount = Math.Min(partitionSize, Count - partitionNum * partitionSize)
                    })
                .Where(part => part.ChildCount > 0) // Defensive
                .Select(part => new RangeTree(Data, part.ChildStart, part.ChildCount, partitionThreshold))
                .ToList();

            // Now is the dynamic programming part:
            // Calculate the local max as the max of all child max values.
            MaxValue = ChildRanges.Max(chile => chile.MaxValue);
        }
    }

    // Get the max value for a given range of dates withing this rangeTree node.
    // This used the precalculated values as much as possible.
    // Only at the fringes of the date range to we calculate at the element level.
    public double MaxForDateRange(DateTime fromDate, DateTime thruDate)
    {
        double calculatedMax = Double.MinValue;
        if (fromDate > this.LastDateTime || thruDate < this.FirstDateTime)
        {
            // Entire range is excluded. Nothing of interest here folks.
            calculatedMax = Double.MinValue;
        }
        else if (fromDate <= this.FirstDateTime && thruDate >= this.LastDateTime)
        {
            // Entire range is included. Use the already-calculated max.
            calculatedMax = this.MaxValue;
        }
        else if (ChildRanges != null)
        {
            // We have child ranges. Recurse and accumulate.
            // Possible optimization: Calculate max for middle ranges first, and only bother
            // with extreme partial ranges if their local max values exceed the preliminary result.
            for (int i = 0; i < ChildRanges.Count; ++i)
            {
                double childMax = ChildRanges[i].MaxForDateRange(fromDate, thruDate);
                if (childMax > calculatedMax)
                {
                    calculatedMax = childMax;
                }
            }
        }
        else
        {
            // Leaf range. Loop through just this limited range of notes, checking individually for
            // date in range and accumulating the result.
            for (int i = 0; i < this.Count; ++i)
            {
                var element = Data[this.Start + i];
                if (fromDate <= element.Date && element.Date <= thruDate && element.Value > calculatedMax)
                {
                    calculatedMax = element.Value;
                }
            }
        }

        return calculatedMax;
    }
}

There's plenty of room for improvement, such as parameterizing the types and generalizing the functionality to support more than just Max(Value), but the framework is there.

T N
  • 4,322
  • 1
  • 5
  • 18
0

Assuming you meant you need the maximum Value for each of the last 12 months from result, then you can use LINQ:

var beginDateTime = DateTime.Now.AddMonths(-12);
var ans = result.Where(r => r.Date >= beginDateTime).GroupBy(r => r.Date.Month).Select(mg => mg.MaxBy(r => r.Value)).ToList();

Running some timing, I get that putting AsParallel after result changes the run time from around 16ms (first run) to around 32ms, so it is actually slower. It is about the same after the Where and about 23ms after the GroupBy (processing the 12 groups in parallel). On my PC at least, there isn't enough data or complex operations for parallelism, but the GroupBy isn't the most efficient.

Using an array and testing each element, I get the results in about 1.2ms:

var maxMOs = new MyObject[12];
foreach (var r in result.Where(r => r.Date >= beginDateTime)) {
    var monthIndex = r.Date.Month-1;
    if (maxMOs[monthIndex] == null || r.Value > maxMOs[monthIndex].Value)
        maxMOs[monthIndex] = r;
}

Note that the results are not chronological; you could offset monthIndex by today's month to order the results if desired.

var maxMOs = new MyObject[12];
var offset = DateTime.Now.Month-11;
foreach (var r in result.Where(r => r.Date >= beginDateTime)) {
    var monthIndex = r.Date.Month-offset;
    if (maxMOs[monthIndex] == null || r.Value > maxMOs[monthIndex].Value)
        maxMOs[monthIndex] = r;
}

A micro-optimization (mostly useful on repeat runnings) is to invert the test and use the null-propagating operator:

if (!(r.Value <= maxMOs[monthIndex]?.Value))

This saves about 0.2ms on the first run but up to 0.5ms on subsequent runs.

NetMage
  • 26,163
  • 3
  • 34
  • 55
0

Here is a solution similar to julian bechtold's answer. Difference is that the maximum (and all related variables) are kept hidden away from the main implementation, in a separate class whose purpose is solely to keep track of the maximum over the past year. Algorithm is the same, I just use a few Linq expressions here and there.

We keep track of the maximum in the following class:

        public class MaxSlidingWindow
        {
            private readonly List<MyObject> _maximumValues;
            private double _max;

            public MaxSlidingWindow()
            {
                _maximumValues = new List<MyObject>();
                _max = double.NegativeInfinity;
            }

            public double Max => _max;
            
            public void Add(MyObject myObject)
            {
                if (myObject.Value >= _max)
                {
                    _maximumValues.Clear();
                    _max = myObject.Value;
                }
                else
                {
                    RemoveValuesSmallerThan(myObject.Value);
                }

                _maximumValues.Add(myObject);
                RemoveObservationsBefore(myObject.Date.AddYears(-1));

                _max = _maximumValues[0].Value;
            }

            private void RemoveObservationsBefore(DateTime targetDate)
            {
                var toRemoveFromFront = 0;
                while (_maximumValues[toRemoveFromFront].Date < targetDate && toRemoveFromFront <= maximumValues3.Count -1)
                {
                    toRemoveFromFront++;
                }

                _maximumValues.RemoveRange(0, toRemoveFromFront);
            }

            private void RemoveValuesSmallerThan(double targetValue)
            {
                var maxEntry = _maximumValues.Count - 1;
                var toRemoveFromBack = 0;
                while (toRemoveFromBack <= maxEntry && _maximumValues[maxEntry - toRemoveFromBack].Value <= targetValue)
                {
                    toRemoveFromBack++;
                }

                _maximumValues.RemoveRange(maxEntry - toRemoveFromBack + 1, toRemoveFromBack);
            }
        }

It can be used as follows:

        public static MyObject[] GetTestObjects_MaxSlidingWindow()
        {
            var rnd = new Random();
            var date = new DateTime(2021, 1, 1, 0, 0, 0);
            var result = new List<MyObject>();
            var maxSlidingWindow = new MaxSlidingWindow();
            for (int i = 0; i < 50000; i++)
            {
                //this is to simulate real data having gaps
                if (rnd.Next(100) < 25)
                {
                    continue;
                }
                var myObject = new MyObject()
                {
                    Value = rnd.NextDouble(),
                    Date = date.AddMinutes(15 * i)
                };
                
                maxSlidingWindow.Add(myObject);
                var max = maxSlidingWindow.Max;
                result.Add(new MyObject { Date = myObject.Date, Value = myObject.Value / max });
            }
            return result.ToArray();
        }

See the relative timings below - above solution is slightly faster (timed over 10 million runs), but barely noticeable:

Relative timings

Roger
  • 11
  • 3
  • Interesting to note for others and for future reference - the algorithm that [julian bechtold](https://stackoverflow.com/users/8695110/julian-bechtold) used is described in detail [here](https://people.cs.uct.ac.za/~ksmith/articles/sliding_window_minimum.html) and [here](https://richardhartersworld.com/slidingmin/), and is usually referred to as the Ascending Minima (Maxima) algorithm. See also [this](https://dercuano.github.io/notes/sliding-rmq-achievements.html) page describing the problem. – Roger Jan 08 '22 at 08:38
  • Old stack overflow dealing with the complexity of the problem can be found [here](https://stackoverflow.com/questions/53094476/why-is-the-deque-solution-to-the-sliding-window-maximum-problem-on-instead-o). Obviously the textbook problem has a fixed window, whereas the posted question may have gaps in the data - but that is a minor adjustment and does not change the complexity. – Roger Jan 08 '22 at 09:06