2

I have two lists, the first of which represents times of observation and the second of which represents the observed values at those times. I am trying to find the maximum observed value and the corresponding time given a rolling window of various length. For example-sake, here are the two lists.

# observed values
linspeed = [280.0, 275.0, 300.0, 475.2, 360.1, 400.9, 215.3, 323.8, 289.7]

# times that correspond to observed values
time_count = [4.0, 6.0, 8.0, 8.0, 10.0, 10.0, 10.0, 14.0, 16.0]

# actual dataset is of size ~ 11,000

The missing times (ex: 3.0) correspond to an observed value of zero, whereas duplicate times correspond to multiple observations to the floored time. Since my window will be rolling over the time_count (ex: max value in first 2 hours, next 2 hours, 2 hours after that; max value in first 4 hours, next 4 hours, ...), I plan to use an array-reshaping routine. However, it's important to set up everything properly before, which entails finding the maximum value given duplicate times. To solve this problem, I tried the code just below.

def list_duplicates(data_list):
    seen = set()
    seen_add = seen.add
    seen_twice = set(x for x in data_list if x in seen or seen_add(x))
    return list(seen_twice)

# check for duplicate values
dups = list_duplicates(time_count)
print(dups)
>> [8.0, 10.0]

# get index of duplicates
for dup in dups:
    print(time_count.index(dup))
>> 2
>> 4

When checking for the index of the duplicates, it appears that this code will only return the index of the first occurrence of the duplicate value. I also tried using OrderedDict via module collections for reasons concerning code efficiency/speed, but dictionaries have a similar problem. Given duplicate keys for non-duplicate observation values, the first instance of the duplicate key and corresponding observation value is kept while all others are dropped from the dict. Per this SO post, my second attempt is just below.

for dup in dups:
    indexes = [i for i,x in enumerate(time_count) if x == dup]
print(indexes)
>> [4, 5, 6] # indices correspond to duplicate time 10s but not duplicate time 8s

I should be getting [2,3] for time in time_count = 8.0 and [4,5,6] for time in time_count = 10.0. From the duplicate time_counts, 475.2 is the max linspeed that corresponds to duplicate time_count 8.0 and 400.9 is the max linspeed that corresponds to duplicate time_count 10.0, meaning that the other linspeeds at leftover indices of duplicate time_counts would be removed.

I'm not sure what else I can try. How can I adapt this (or find a new approach) to find all of the indices that correspond to duplicate values in an efficient manner? Any advice would be appreciated. (PS - I made numpy a tag because I think there is a way to do this via numpy that I haven't figured out yet.)

2 Answers2

1

OK, if you want to do this with numpy, best is to turn both of your lists into arrays:

l = np.array(linspeed)
tc = np.array(time_count)

Now, finding unique times is just an np.unique call:

u, i, c = np.unique(tc, return_inverse = True, return_counts = True)

u
Out[]: array([  4.,   6.,   8.,  10.,  14.,  16.])

i
Out[]: array([0, 1, 2, 2, 3, 3, 3, 4, 5], dtype=int32)

c
Out[]: array([1, 1, 2, 3, 1, 1])

Now you can either build your maximums with a for loop

m = np.array([np.max(l[i==j]) if c[j] > 1 else l[j] for j in range(u.size)])

m
Out[]: array([ 280. ,  275. ,  475.2,  400.9,  360.1,  400.9])

Or try some 2d method. This could be faster, but it would need to be optimized. This is just the basic idea.

np.max(np.where(i[None, :] == np.arange(u.size)[:, None], linspeed, 0),axis = 1)
Out[]: array([ 280. ,  275. ,  475.2,  400.9,  323.8,  289.7])

Now your m and u vectors are the same length and include the output you want.

Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • And here I was playing with `np.argmax`... I will have a chance to test/play this in about 2 hours, thanks. –  May 05 '17 at 08:26
  • This solution is quadratic in both time and memory; given that timeseries such as described can get very large, it may not be the best solution. – Eelco Hoogendoorn May 08 '17 at 07:34
  • True, though it could possibly be optimized with sparse arrays. Depends on the dataset, what will be worthwhile - and a `for` loop will also be very slow if the timeseries is long. – Daniel F May 08 '17 at 07:44
1

Without going into the details of how to implement and efficient rolling-window-maximum filter; reducing the duplicate values can be seen as a grouping-problem, which the numpy_indexed package (disclaimer: I am its author) provides efficient and simple solutions to:

import numpy_indexed as npi
unique_time, unique_speed = npi.group_by(time_count).max(linspeed)

For large input datasets (ie, where it matters), this should be a lot faster than any non-vectorized solution. Memory consumption is linear and performance in general NlogN; but since time_count appears to be sorted already, performance should be linear too.

Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • I have yet to familiarize myself with groupby, though the examples I've seen show how useful it is. The posted solution works for my large dataset, but I always appreciate learning alternative methods and tips; thanks! –  May 08 '17 at 07:59
  • I found the [associated package index](https://pypi.python.org/pypi/numpy-indexed) and the [associated github page](https://github.com/EelcoHoogendoorn/Numpy_arraysetops_EP), but it would really help if there were docs with example usage of various functions (similar to numpy/scipy docs). Is there any related documentation that I have not found yet? –  Nov 26 '17 at 10:06
  • I ask because I would like to know the indices of the original array (in this case, `linspeed`) that are used or omitted when finding the maximum value per duplicate such that the same indices can be used in other arrays. –  Nov 26 '17 at 10:14
  • 1
    group_by also has an argmax attribute. But indeed you are correct, nice docs would be helpful, but I havnt got around to it. I would be happy to review a PR though :). Also, the code is pretty clean and all the function level docstrings are quite comprehensive, so searching the repo or doing go-to-definition in your IDE should work quite alright. – Eelco Hoogendoorn Nov 27 '17 at 09:44
  • I haven't learned how to do PRs yet, but I plan to within the next 3-4 months and can let you know when I do. Thanks btw, your tip about `argmax` being an attribute came in handy. –  Nov 29 '17 at 10:19