Fast calculation of min, max, and average of incoming numbers

Question

Program is receiving approximately 50,000 numbers every second.

At ANY given moment, I need to calculate minimum, maximum and average of the values (numbers) that arrived in the last second (regarding to given moment).

Is there a way to do this without using array or list (buffer) to store arriving numbers and to calculate results?

If I need to use buffer, what would be the efficient way to achieve this?

(Note that numbers from buffer also must be efficiently removed from time to time)

you say 'approximately 50,000' can it vary or are you unsure of the #? — n8wrl, Apr 23 '12 at 20:54
@ademing, numbers are arriving in order but they are not ordered (like n1 < n2 < n3 < ...) — Dusan, Apr 23 '12 at 21:00
Do you need a running average, or is it sufficient to provide an average over the last complete 1 second interval? The latter is much simpler. — Nick Johnson, Apr 24 '12 at 03:33
@Dusan, in that case, check out my answer. I believe it is definitely a possible solution. — Aaron Deming, Apr 24 '12 at 13:09

yamen · Accepted Answer · 2012-05-03T20:55:48.597

Here is an algorithm that will somewhat work to save efficiency in certain cases:

As events come in, buffer them completely, and calculate a running sum, count, min, max (trivial).
When a request for average, min, or max is made, loop through from the back of the buffer and start removing values older than one second. Subtract from sum and count as you go.
- If the values are all above min you can keep your min. If the values are below max, you can keep your max. In this scenario, you have average, min, and max updated efficiently.
- If the values are below min or above max, you'll need to loop through the rest of the array and recalculate it.
Do step two once a second or so also so that the buffer doesn't get too full. This code could be performed on every buffer insert also, or wherever made sense.

Best structure for this kind of work is a circular buffer, to avoid memory allocations and GC getting in the way. It should be large enough to cover the worst case scenario for message size per second.

Updates

Depending on the usage scenario one other thing to do would be to run the algorithm above but in 10 x 100ms chunks rather than 1 x 1000ms piece. That is, keep the running min, max, sum and count on those 10 chunks. Then when you reach an 'invalidation' scenario, you generally only need to look through the latest 100ms of data or a quick pass through the min and max of the other 9 chunks.

@ja72 provided a great idea to save on finding the min and max values if they are invalidated:

Instead of keeping the min/max values x_min, x_max keep instead the index of where they are located in the x[i] array with i_min and i_max. Then finding them can be trivial sometimes, but when the last value considered holds the min and max, the entire list needs to be scanned to establish the new limits.

Sam Holder had another good idea in the comments - keep a parallel array that is always sorted, this lets you lop numbers off the top or bottom to find new minimums and maximums easier. However, insert speed here is compromised a little (needs to remain in order).

Ultimately, the right choice will depend on the usage characteristics of the program. How often will values be read vs how often they are inserted?

By using the interlocked queues in my suggestion, it's possible to never have to look for the min and max (the queue maintains them, and no work has to be done except when elements are added or removed), but you're probably right that in most scenarios the cost of allocating the nodes and maintaining a sorted list is more than having to look through the list again for the new max/min. — Theodore Murdock, Apr 23 '12 at 21:38
It will depend on how often a query is made. If not made often, inserts need to be cheap, lookups potentially expensive. And vice versa. — yamen, Apr 23 '12 at 21:41
If I use circular buffer (one time allocated array in combination with lower and upper bound), does that mean that inserts and deletes(shrinks) are very cheap? — Dusan, Apr 23 '12 at 21:57
@Dusan: Inserts will be O(1). Deletes can be O(n) if you pass a max or min. Lookups may also be expensive. — btilly, Apr 23 '12 at 22:11
If I am correct, finding index of the most recent element which is older than one second can be done using binary search? — Dusan, Apr 23 '12 at 22:22
@Dusan binary search works only on an array of SORTED elements. — Daniel Mošmondor, Apr 24 '12 at 05:14
They are sorted on time, so technically it is possible. Note that @ja72 below has a great trick for min/max that I've incorporated into my answer so people don't miss it. — yamen, Apr 24 '12 at 05:50

score 6 · Answer 2 · answered Apr 23 '12 at 20:58

Use circular buffer with each element having timestamp and data, having maximum number of the elements per second as the size of the circular buffer.

As each element is inserted into buffer head, check for the expiration on the other side of the buffer, delete the element.

If the deleted element is minimum or maximum, you'll have to calculate new min/max. If it isn't, you will update min/max according to the new arrivals.

For the avg, keep the total, keep the count, and divide.

Sam Holder · Answer 3 · 2012-04-23T21:05:40.647

3

can't you keep a queue with your numbers and their arrival times, along with the current maximum & minumum values in the queue (will probably need to keep count of the number of values at the same min/max) and total value of all numbers in the queue and count of elements.

Then when a number arrives add it to the queue and adjust the min/max/value and count. Then look at the other end of the queue and remove all elements that are not within 1 sec of the arrival of the last number, and again adjust the max/min/count/total value.

Then you don't need to keep calculate anything at an instant, just return the precalculated stuff (ie read the current value of min/max or total/count)

As @yaman pointed out you can't retain just the min and max as when one gets removed you might not know the new one. in this case I would probably just keep a second copy of all the numbers in the list but rather than ordered by arrival time I would order by value. Then you just add and remove each number from this list as well, so you'll always know the max and minimum values. This saves you having to scan all elements in the buffer to find the new max/min, at the expense of keeping 2 copies, but updates to this list should be cheap as it is already ordered.

edited Apr 23 '12 at 21:05

answered Apr 23 '12 at 20:53

Sam Holder

32,535
13
101
181

Count and total value are fine to adjust on the fly. Min and max can't be adjusted by removing values, since it takes a full scan of all values to find a new one when an old one is invalidated. My answer covers this worst case scenario. – yamen Apr 23 '12 at 21:01
Keeping a sorted list is overkill. Instead keep some sort of priority queue, such as a heap. Those take a logarithmic amount of work to get the max from, insert into, and delete from. However you'll need one for the max and another for the min, so you need 3 lists. – btilly Apr 23 '12 at 22:15
@btilly Good suggestion to use heaps instead of a sorted list. Reading the max/min is O(1), it's removing the max/min element that's O(log n), because work is needed to maintain the heap. As pointed out in my answer for lists, you'd need to use a single node object per element (despite using two heaps and a list), so you can remove the elements from the heaps when you discover they've expired, otherwise deletes would be O(n) as you'd have to search each heap for expired elements. Using a single node means we know where the element is in each heap as we eliminate expired elements from the list. – Theodore Murdock Apr 27 '12 at 00:12

score 2 · Answer 4 · answered Apr 23 '12 at 20:57

@DanRedux is correct; you will need to calculate them each time because your input is changing. Now, you may prefer to calculate these numbers on demand or up front (i.e., when you get a new batch) depending on how often the results are needed.

For example, if your average use case polls for these stats every ~30 seconds then I would probably just calculate them on demand and cache the result until a new batch comes in. It really comes down to your usage scenario though.

As for how to store them, you don't really have a choice, do you? You need space for all 50,000 numbers in memory. So... you need a chunk of memory large enough to hold them. To avoid constantly allocating 2KB every time a new sequence comes in you're probably better off declaring an array large enough to hold the largest data set possible and just reusing it. Again, this comes down to your requirements, i.e., do you know what your largest possible data set will be? Does allocating a new chunk of memory ever second cause problems in your application over time?

score 2 · Answer 5 · answered Apr 23 '12 at 21:02

If the average of the last N values x[0] .. x[N-1] is m_1 (x[0] is the latest value, and x[N-1] the last value considered) then the average m_2 of the values pushing everything back by one index and adding the value x is

 m_2 = m_1+(x-x[N-1])/N;
 for(i=N-1;i>0;i--) { x[i]=x[i-1]; }
 x[0] = x;

Instead of keeping the min/max values x_min, x_max keep instead the index of where they are located in the x[i] array with i_min and i_max. Then finding them can be trivial sometimes, but when the last value considered holds the min and max, the entire list needs to be scanned to establish the new limits.

Great trick with minimum and maximum indexes, can be very handy! — Dusan, Apr 23 '12 at 22:05

Ilmari Karonen · Answer 6 · 2012-04-24T00:29:43.420

There's an efficient way to keep track of the minimum (or maximum) value within a given time window without usually having to store all the numbers that have arrived within that window. (However, the worst case scenario still does require storing all the numbers, so you need to reserve space for all of them or accept that you may sometimes get incorrect results.)

The trick is to only store values that:

have arrived within the time window, and
are smaller (or larger) than any later value.

A suitable data structure for implementing this is a simple circular buffer storing values and their arrival times. You'll need to maintain two indices into the buffer. Here's a simple English description of the algorithm:

At startup:

Allocate an N-element buffer val of values and a corresponding N-element buffer time of timestamps.
Let imax = 0 (or any other value between 0 and N−1 inclusive) and let inext = imax. This indicates that the buffer is currently empty.

When a new value new is received at time t:

While imax ≠ inext and time[imax] is outside the interval, increment imax by one (modulo N).
While imax ≠ inext and val[inext-1] ≥ new, decrement inext by one (modulo N).
Let val[inext] = new and time[inext] = t.
If inext ≠ imax-1, increment inext by one (modulo N); else deal with the "buffer full" condition appropriately (e.g. allocate a larger buffer, throw an exception, or just ignore it and accept that the last value was not correctly recorded).

When the minimum value is requested:

While imax ≠ inext and time[imax] is outside the interval, increment imax by one (modulo N).
If imax ≠ inext, return val[imax]; else return an error indicating that no values have been received within the time interval.

If the values received are independent and identically distributed (and arrive as a Poisson process), I believe it can be shown that the average number of values stored in the list at any given time is ln(n+1), where n is the average number of values received within the time interval. For n = 50,000, ln(n+1) ≈ 10.82. However, one should keep in mind that this is only the average, and that several times more space may occasionally be required.

For the average, the same trick unfortunately doesn't work. If possible, you could switch to an exponentially moving average, which can be easily tracked using very little space (just one number for the average and one timestamp indicating when it was last updated).

If that's not possible, but you're willing to accept a small amount of smoothing in the average values, you could calculate an average over, say, each millisecond. That way, whenever an average of the values over the last second is requested, you can just take an average of the last 1001 millisecond averages, weighing the oldest and newest of them according to how much of those milliseconds lie within the interval:

At startup:

Let interval be the length of the time interval to average over, and let n be the number of subintervals.
Let dt = interval / n.
Allocate an n+1 -element buffer sum of values and an n+1 -element buffer cnt of non-negative integers, and fill both with zeros.
Let prev have any value. (It doesn't really matter.)

When a new value new is received at time t:

Let i = floor(t / dt) mod (n+1).
If i ≠ prev:
- Subtract sum[i] from total and cnt[i] from count.
- Let sum[i] = 0, cnt[i]= 0 and let prev = i.
Add new to sum[i] and increment cnt[i] by one.
Add new to total and increment count by one.

When the average value is requested at time t:

Let i = floor(t / dt) mod (n+1).
If i ≠ prev:
- Subtract sum[i] from total and cnt[i] from count.
- Let sum[i] = 0, cnt[i]= 0, and let prev = i.
Let j = (i−n) mod (n+1) = (i+1) mod (n+1).
Let w = frac(t / dt) = (t / dt) − floor(t / dt).
Return (total − w × sum[j]) / (count − w × cnt[j]).

score 1 · Answer 7 · answered Apr 23 '12 at 20:51

1

Sadly, there isn't. The reason why it's not possible is because you need to only account for them that are a second old, which means you have to re-calculate the result every time, which means HUGE loops.

If you wanted to calculate the last 40,000 numbers, or all of them, it would be easier, but because it's time-based, you have to loop through the entire list every time.

answered Apr 23 '12 at 20:51

DanRedux

9,119
6
23
41

You only need to keep the last second worth of data – David Heffernan Apr 23 '12 at 20:54
1

Just not true. See the other answers. – usr Apr 23 '12 at 20:57
Agreed, there are clever ways to look at this problem. – yamen Apr 23 '12 at 20:58
It's impossible to avoid looping, is all I'm saying. Even just keeping the min becomes an O(n) problem, not an O(1) problem, as when the current minimum is lost, you have to scan the ENTIRE buffer of 50,000 values for the new minimum. It's possible to make it efficient, sure, but if you are limiting by time instead of a static number, then it's impossible to avoid the loop completely. – DanRedux Apr 23 '12 at 21:07
You only need to scan the last second's worth of data. Same amount of data each time you read it. Therefore, O(1). – David Heffernan Apr 23 '12 at 21:12
Thanks DanRedux! Can you tell me what would be the algorithm if I would have static number of elements, for example always 40,000? – Dusan Apr 23 '12 at 21:13
David, that's the worst logic ever. It's still linear time, not constant, because to calculate the average you need to do a linear loop. Dusan, if you always have 40,000 records, it's not too hard. For the min/max, start with an empty list, and any time you find a value higher or equal to the highest value of the list, append it to the list, and any time you lose a value, check if it's in that list and if so, remove it. This lets you maintain a sort-of-sorted list that is a lot smaller. Do the same for min. For the average, you remove/add the impact of numbers as you lose/gain them. – DanRedux Apr 23 '12 at 21:28

score 1 · Answer 8 · answered Apr 23 '12 at 21:01

Is there a way to do this without using array or list (buffer) to store arriving numbers and to calculate results?

No. It's provably impossible to do this without storing the information, as you stated it. You could tweak the requirements a bit to get rid of the need for a buffer though.

If I need to use buffer, what would be the efficient way to achieve this?

You'll want to use a Queue for this.

When an item is added, if it's the new max or min adjust those variables accordingly. You can incrementally adjust the mean through the formula here. Just Take the new value, minus the mean, divided by the new number of items in the set (i.e. size of queue plus one) and then add that to the old mean.

Then you'll have something more or less like this:

while(queue.Peek < oneSecondAgo)
{
  oldItem = queue.Peek
  queue.Dequeue();
  if(oldItem == min) //recalculate min
  if(oldItem == max) //recalculate max
  mean += SubtractValueFromMean(oldItem.Value, queue.Count);
}

To remove the value from the mean you should be able to just use the same formula for adding, but use the negative of the value rather than the positive...I think. A better mathematician may need to help you out here.

score 1 · Answer 9 · answered Apr 23 '12 at 21:02

If the numbers are coming one after the other, then use a stopwatch and a while loop to get every number one by one for one second, and calculate min, max, and avg.

double min = double.MaxValue;
double max = double.MinValue;
double sum = 0;
int count = 0;
double avg;
StopWatch sw = new StopWatch();
sw.Start();
while(sw.Elapsed.TotalSeconds <= 1)
{
   // Get the next number in the stream of numbers
   double d = GetNextNumber();

   // Calculate min
   if(d < min) min = d;
   // Calculate max
   if(d > max) max = d;

   // Calculate avg = sum/ count
   sum += d;
   count++;
}

avg = sum/count;

Then return min, max, and avg.

Theodore Murdock · Answer 10 · 2012-04-27T04:05:29.140

It's not possible to do without holding on to the numbers in a buffer or queue.

The reason for that is simple: when a maximum value expires (falls out of the 1 second window), the new maximum is some other number that arrived within the last second, so you need to have a record of the candidates that could become the new maximum.

Needing the average means all values have an effect when they expire, and nothing can be thrown out before it's one second old.

Sam Holder's suggestion of using a queue is a good one, though you'll likely need a specialized one that can keep your list in two orders simultaneously: the order in which the numbers were received (arrival time), and ordered from maximum to minimum.

Using a single node object with two next and two previous pointers (one pair temporally, and the other in terms of size) would make it possible to remove elements from both lists simultaneously, when an element expires from the temporal list, you have access to the pointers for the size list, because they're in the same node object.

The average can be maintained by keeping a running total and a running count, subtracting elements as they are removed, and adding them as they are created, so it is not necessary to iterate over the whole list each time in order to calculate the average.

As btilly suggested in their comment on Sam Holder's post, it would be more efficient to use a max heap and a min heap than to use a list, we would again need to use a single node with pointers for both heaps and the list, so we don't have to search for elements to remove them, and it might be necessary to spend some time considering how to properly remove elements not at the top of the heap, while maintaining the guarantee of O(log n) insertions and removals.

score 0 · Answer 11 · answered Apr 25 '12 at 16:31

For average, there are 3 cases to consider:

Your numbers are integers. Keep a running total and count, add new values to the total, subtract old values from the total, and divide by the count as needed. This is simple because you don't have to worry about loss of precision.
Your numbers are floating point and you require 0 loss of precision: You'll have to iterate over the entire one-second list to compute an average
Your numbers are floating point and you can live with some loss of precision: Operate as for the integer average, doing a complete recalculation every 1000 values or so.

For min and max (only relevant for #1 and #3 above):

Keep the values in a treap indexed by value.
Also keep the values in a doubly linked list ordered by time. Save the beginning and end of the list.
Remove from the beginning of the list, and add to the end of the list.
For each new value: Add it to beginning of the time linked list. Remove values as needed from the end of the time linked list.

As you add and remove values to and from the linked list, perform corresponding operations on the treap. To get a min and max from the treap, just do find_minimum and find_maximum operations in log(n) time. As you remove things from the right end of the linked list in constant time, also remove them from the treap in log(n) time.

Treaps can find their minimum value in log(n) time, find their maximum value in log(n) time, and find an arbitrary value in log(n) time. In general, the more different ways you need to access your data, the better a well-rounded datastructure like a treap looks.

Fast calculation of min, max, and average of incoming numbers

11 Answers11

Linked