Faster way to compare thousands of values

Question

I have the following dataframe:

Price, Volume
100, 45656
101, 67563
103, 755
...
...
6543, 67567
6544, 7654

Each entry in the Price column is unique and there are several thousand rows. The goal is to identify the low volume prices over a rolling range of rows. In other words im not trying to identify the lowest volume in the whole dataframe. Im identifying many low volume rows over a rolling 'window'.

Lets say I set the rolling window to 50. What I do is then compare the current volume value to the 50 volume values above, and the 50 volume values below it. If the current volume value is the lowest within that range, I save the corresponding price to a separate list. I then move one row down and once again compare to see if the current volume value is smaller than the 50 above and below it.

My code below works correctly to accomplish this task:

rolling_window = 50
total_rows = len(df.index)
current_row = rolling_window
smallest_values = []

while current_row < total_rows - rolling_window:
    is_smallest = True
    for comparison_row in range(rolling_window):
        if vp.iloc[current_row]['Volume'] > vp.iloc[current_row -   comparison_row]['Volume'] or \
            vp.iloc[current_row]['Volume'] > vp.iloc[current_row + comparison_row]['Volume']:
            is_smallest = False
            break
    if is_smallest and vp.iloc[current_row]['Price'] not in smallest_values:
        smallest_values.append(vp.iloc[current_row]['Price'])
    current_row += 1

print(smallest_prices)

My problem is that it is extremely slow when dealing with large dataframes (several thousand items). Im sure there must be a better way of accomplishing what im trying to do that is more efficient. I fear im making the algorithm do more work than is necessary but I haven't been able to think of another way to do it.

I would be very grateful if anyone could suggest a faster/more efficient way of doing this.

How should you handle ties (same min volume for different prices within the window)? — Alexander, Mar 25 '15 at 08:15
@Alexander it should ideally save all the ties. ie: If there are a few prices within the window which all have the same volume, it should save all those prices — darkpool, Mar 25 '15 at 08:20

Ron Thompson · Answer 1 · 2015-03-25T08:32:18.077

2

Wouldn't it make more sense to skip 49 up (from the lowest) instead of just one? Because the next 49 values can't be lower than the one you just found, if it was the lowest.

Also, on another front, you might try to use an ordered map, since you say the prices are all unique. Then you could just look at one end of the map (depending on how it's sorted) to pull off the minimum key/value pairs. I'm assuming, of course, that the implementation of that map is well done, but if it's in your standard library, it probably is.

That way you could feed the list 100 values at a time into maps and have a heyday with it.

edited Mar 25 '15 at 08:32

answered Mar 25 '15 at 08:08

Ron Thompson

1,086
6
12

Thanks Ron. Yes you are correct I just checked this and it reduced the run time of the algorithm by about a third. Thank very much, I will update the original post. However it is still very slow. To process a few thousand entries takes several minutes. – darkpool Mar 25 '15 at 08:16
Ron, the only problem with this is in the case of ties (prices with equal volume) as Alexander mentioned. By doing this, the algorithm would save the first smallest volume, but then skip ahead potentially missing any ties, which is not desired. – darkpool Mar 25 '15 at 08:23
Another optimization I can think of is to jump by ~100 (and only store the minimum of each set of 100). That should reduce the number of comparisons by another factor of 2. With that method, you can also store the ties. Since before and after 50 is a set of 100 anyway. Just find the last minimum in the first set and jump (98?) away from that and compare the previous 99 values. Either way you could store the ties because you know the minimum, so you could jump from the last tie. – Ron Thompson Mar 25 '15 at 08:24
@darkpool: Updated answer with another idea. – Ron Thompson Mar 25 '15 at 08:32
Thanks Ron. That makes sense. I'll give it a go and see how things improve and let you know. – darkpool Mar 25 '15 at 08:34
@darkpool: Here's a link to another question on that, in case Alexander's solution doesn't work for you, (idk, it looks pretty good to me), http://stackoverflow.com/questions/9001509/how-can-i-sort-a-python-dictionary-sort-by-key – Ron Thompson Mar 25 '15 at 08:42

score 1 · Accepted Answer · answered Mar 25 '15 at 08:33

1

Step 1: implement a rolling min with 101 periods (50 up and 50 down from current point).

Step 2: Center these min values by shifting them down by 50.

Step 3: Compare the volume to the shifted min values. If they match, then that should be the price with the lowest volume within your window.

Step 4: Filter for matches.

Step 5: Enjoy your extra minutes of free time!

import pandas as pd
import bumpy as np

df = pd.DataFrame({'price': range(1000), 
                   'volume': np.random.random_integers(0, 500000, 1000)})
df['min_volume'] = pd.rolling_min(df.volume, 101)
df['min_shift'] = df['min_volume'].shift(-50)
df['match'] = df.volume == df.min_shift
>>> df[df.match]
Out[39]: 
     price  volume   min  min_shift match
181    181    4317  4317       4317  True
245    245    4478  4317       4478  True
358    358    1118  1118       1118  True
427    427    7251  1118       7251  True
504    504   10680  7251      10680  True
631    631    1096  1096       1096  True
699    699     277   277        277  True
770    770    2037   277       2037  True
828    828     310   310        310  True
931    931     516   516        516  True

To get just the prices:

df[df.match].price

answered Mar 25 '15 at 08:33

Alexander

105,104
32
201
196

It also saves all prices where there is a tie with the min value within a window. – Alexander Mar 25 '15 at 08:38
btw, my DataFrame has 'min', but the code snippet has 'min_volume'. I renamed the variable after originally running it to avoid any name clashes with the min DataFrame method. – Alexander Mar 25 '15 at 08:48
Excellent @Alexander. Thank you very much. This is some serious ninja code that I would never have been able to do myself. – darkpool Mar 25 '15 at 08:59
How much faster is it running? – Alexander Mar 25 '15 at 09:08
Well, after 5 minutes running my original code I just gave up and killed the process. Now it takes a few seconds depending on the size of the rolling window! – darkpool Mar 25 '15 at 09:24
Make sure to see Step 5! – Alexander Mar 25 '15 at 09:27

Faster way to compare thousands of values

2 Answers2