6

I'm creating an iterative algorithm (Monte Carlo method). The algorithm returns a value on every iteration, creating a stream of values.

I need to analyze these values and stop the algorithm when say 1000 returned values are withing some epsilon.

I decided to implement it calculation the max and min values of the last 1000 values, and then calculate the error using this formula (max-min)/min and compare it to epsilon: error<=epsilon. And if this condition is reached, stop the iterations and return the result.

  1. The first hare-brained idea was to use a list and append new values to it, calculating the max and min values for the last 1000 values of it after each appending.

  2. Then I decided there is no use of keeping more that 1000 last values. So I remembered of deque. It was a very good idea since the complexity on adding and deleting on both ends of deque object is O(1). But it didn't solve the problem of needing to go through all the last 1000 values on each iteration to calculate min and max.

  3. Then I remembered there is the heapq module. It keeps the data in such a way as to efficiently return the smallest one at every moment. But I need both the smallest and the largest ones. Furthermore I need to preserve the order of the elements so that I can keep 1000 last returned elements of the algorithm, and I don't see how I can achieve it with heapq.

Having all those thoughts in mind I decided to ask here:

How can I solve this task the most efficiently?

ovgolovin
  • 13,063
  • 6
  • 47
  • 78
  • @fredley The data should be monotonic. But the [Monte Carlo algorithm](http://en.wikipedia.org/wiki/Monte_Carlo_simulation) relies on random sampling, so it may have some surges. So I can't be completely sure it's monotonic. – ovgolovin Oct 24 '11 at 13:44

6 Answers6

7

If you are free / willing to change your definition of error, you might want to consider using the variance instead of (max-min)/min.

You can compute the variance incrementally. True, using this method, you are not deleting any values from your stream -- the variance will depend on all the values. But so what? With enough values, the first few won't matter a great deal to the variance, and the variance of the average, variance/n, will become small when enough values cluster around some fixed value.

So, you can choose to halt when the variance/n < epsilon.

Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 1
    It would also be easy to remove the values droppping out of the last 1000 from the running variance in the same way you include the new ones, just with the opposite signs. Of course this would still require to keep a deque of the last 1000 values. – Sven Marnach Oct 24 '11 at 13:56
  • I like this answer! Getting rid of calculating the `min` and `max` we have no need to rescan the last `1000` values. We even can calculate the variance of the last `1000` values (thanks to @SvenMarnach for the idea). I think I'll implement this algorithm. But I think I'll get rid of `1000` values and will calculate the variance all the values (because `1000` was of the top of the had, it could be `100` or `10000`. I don't know what number should I elect). – ovgolovin Oct 24 '11 at 14:08
  • @ovgolovin: Be carefull with the approach of keeping all values. The later new values appear, the less weight they will have on the final result. This would lead to a much stricter convergence criterion if convergence appears later. – Sven Marnach Oct 24 '11 at 14:14
  • @SvenMarnach But how can I choose the right number of the values to make assessment on (I chose 1000 but it was just a random choice without some calculation behind it). **aix** gave a good idea of using `exponentially-weighted` moving variance. I need some time to read and to contemplate it. I think it's very promising idea. – ovgolovin Oct 24 '11 at 14:25
  • @ovgolovin: aix's answer is the way to go. I was about to post a similar answer but got disrupted. – Sven Marnach Oct 24 '11 at 15:08
6

As a refinement of @unutbu's excellent idea, you could consider using exponentially-weighted moving variance. It can be computed in O(1) time per observation, requires O(1) space, and has the advantage of automatically reducing the observation's weight as the observation gets older.

The following paper has the relevant formulae: link. See equations (140)-(143) therein.

Finally, you might want to work with the standard deviation instead of variance. It is simply the square root of variance, and has the advantage of having the same units as the original data. This should make it easier to formulate a meaningful stopping criterion.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • It's a great idea. I need some time to read it and to contemplate. I may later ask some questions. Thank you! – ovgolovin Oct 24 '11 at 14:29
4

How about numpy?

Just to compare the speed:

import numpy as np
a = range(1000)
b = np.arange(1000)

max(a) # 29.7us
b.max() # 7.29us

and you can write to this array infinitely:

i = 0
b = np.empty([1000]) + np.nan

your loop:
    b[i % 1000] = value
    i += 1

The values older than 1000 iterations will get overwritten. You get the minimum/maximum with np.nanmin(b) and np.nanmax(b).

The idea behind the nan is that you initialize this array with 1000 nan's and you overwrite them one after another. The nanmin and nanmax methods ignore these nan's.

eumiro
  • 207,213
  • 34
  • 299
  • 261
  • 1
    There is also [np.ptp](http://docs.scipy.org/doc/numpy/reference/generated/numpy.ptp.html#numpy-ptp), which commutes the peak-to-peak differences. – unutbu Oct 24 '11 at 13:57
  • Can you provide a comparison in terms of complexity (e.g. O(n), O(lg n) etc...)? – fredley Oct 24 '11 at 13:57
  • @unutbu - on a sample of 1000 elements, `ptp` is about the same speed as `min` and `max` combined. – eumiro Oct 24 '11 at 14:06
  • fredley: Complexity is the same as the deque, O(n) in each iteration where n=1000. It's “just” four times faster. – Petr Viktorin Oct 24 '11 at 14:10
  • @fredley - adding a new value is O(1) and is faster than with list/deque, since it overwrites an old value at the same time. Searching for min/max is O(n) and is faster than with a Python list. – eumiro Oct 24 '11 at 14:10
  • +1 I was about to write same solution like You but without numpy. – user712092 Oct 24 '11 at 19:10
3

I'm afraid I'm not in a position to provide a nice Python answer now, but I'll give you the outline of the data structure you'll need to use:

Keep your 1000 items in an FIFO queue. Keep pointers to the largest and smallest items in the queue. If one of these leaves the queue, search the queue for the new largest/smallest (Amortized cost dependent on your data). If a new largest/smallest value enters the queue, just update the pointer (O(1)). Assuming your data is converging, this should work well.

fredley
  • 32,953
  • 42
  • 145
  • 236
  • "Pointer"? What is that? Maybe use the index into the deque, but then you have to remember to decrement the index every time you remove an item from the front. Then when one of the indexes goes to -1, it means it's time to recalc. Also, you have to have special case logic while building the initial 1000 values. – PaulMcG Oct 24 '11 at 13:54
  • @PaulMcGuire Sorry, I'm using generic data structure language rather than python language, I'm sure there's a nice way to make it work (`import pointers`?) – fredley Oct 24 '11 at 13:56
  • Hm. Thanks! Assuming the data is converging, the `max` value will be getting out of the last `1000` values on every iteration, so there will be rescans on every iteration. – ovgolovin Oct 24 '11 at 13:59
  • @ovgolovin So, once every 1000 iterations, you incur an O(n) operation, where n = 1000. Amortization => O(1) amortized cost! – fredley Oct 24 '11 at 14:03
  • 1
    If the data is converging, on each iteration `max` or `min` value (depend on the variable is increasing or decreasing) will be getting out of the last `1000` values. So, we will need the rescan the whole `1000` values to update `max` or `min`. So O(n) on each iteration. – ovgolovin Oct 24 '11 at 14:12
  • @ovgolovin, the word is *monotone*, not converging. – avakar Oct 24 '11 at 17:47
  • @avakar If the function is not monotone, but still converge to some value (oscillating near some some value with decreasing amplitude with iterations), the next-to-be-out-of-list value will be `max` or `min` (depending on if the function at the moment decreasing or increasing). But both `max` and `min` require require rescans when they are pooped from the list. – ovgolovin Oct 24 '11 at 18:23
1

Create a subclass of deque that has minvalue and maxvalue properties. When adding or removing entries, compare them to the current min and maxvalues - then you only need to rescan the deque for min/max if the value you are removing is the current min or max. And when adding, just compare the new value with the current min and max, and update accordingly. This will optimize the scanning of your deque for min/max values.

PaulMcG
  • 62,419
  • 16
  • 94
  • 130
  • Thank you. I thought of it. This technique will decrease the average number of needed rescans of the last 1000 values. I thought maybe there is some three that can keep the data as to efficiently return `min` and `max` and to keep the order of the elements allowing efficient deleting and adding on both end as in `deque`. – ovgolovin Oct 24 '11 at 13:56
  • If the data is close to monotonic, the maximum/minimum will tend to be at the beginning/end of your 1000-item window. So, you'll have to rescan almost every time anyway. – Petr Viktorin Oct 24 '11 at 14:07
1

You could use two fibonacci heaps. Adding values is in O(1), deleting is in O(log(n)). In your question you already suggest the heapq module. I am not sure what kind of heap it provides, but a normal one will also work very smoothly.

The problem that you can only extract the minimum from one heap but not the maximum can be solved by keeping two heaps. Since I do not know the heapq module, you either might be able to provide it your own comparison function, or you can just use -value instead of value for the key of the second heap.

LiKao
  • 10,408
  • 6
  • 53
  • 91
  • Great idea with using 2 heaps. Only one question: How can I keep the order of the elements? Because I need to calculate the `min` and `max` values only of the last `1000` added elements. – ovgolovin Oct 24 '11 at 14:34
  • If you're using a finite size (as it seems) and the objects are small (which appears to be the case) just use a third copy in a queue object. It means you're tripling your memory requirements, but since a) memory is cheap, and b) you only need finite size, it seems to be a valid solution. – mklauber Oct 24 '11 at 14:41
  • @mklauber But if we use `deque` we still need to update `heaps`. I looked up [the complexit](http://en.wikipedia.org/wiki/Fibonacci_heap#Summary_of_running_times) of operations with heap. Insertion is of `O(1)`, deletion is `O(log(n))`. But before deleting we need to find the pointer to that value. What is the complexity of this operation? – ovgolovin Oct 24 '11 at 14:59