Determine High Frequency Intervals

Question

I'm trying to extend a function I wrote to find the first 3 values of a dictionary that are "close enough" are also all below a threshold (here N = 70). :

d = {
1: {0: 222, 2:44, 18: 44, 20: 22, 21:72, 105:22, 107:9, 115: 66},
2: {0: 61.0, 993: 65.0, 1133: 84.0, 1069: 48.0, 105:22, 107:9, 115: 24, 214:22, 206:9,       225: 241,412: 83.0, 364: 68.0, 682: 64.0, 172: 58.0} 
}#nested dictionary

def ff(d):
   G = []
   for k,v in sorted(d.iteritems()):
      G.append((k,v))
       #print G
   for i in range(len(G)-2):
      if (G[i+2][0] - G[i][0] < 20) & (G[i][1] <= 70) & (G[i+1][1] <=70) & (G[i+2][1]<=70):
          return i, G[i], G[i+1], G[i+2]

for idnum, ds in sorted(d.iteritems()):
    print ff(ds)

Output:

[(0, 222), (2, 44), (18, 44), (20, 22), (21, 72), (105, 22), (107, 9), (115, 66)]
(1, (2, 44), (18, 44), (20, 22))
[(0, 61.0), (105, 22), (107, 9), (115, 24), (172, 58.0), (206, 9), (214, 22), (225, 241),       (364, 68.0), (412, 83.0), (682, 64.0), (993, 65.0), (1069, 48.0), (1133, 84.0)]
(1, (105, 22), (107, 9), (115, 24)) #first interval fitting criteria

What I'd like to do, is actually find ALL windows of length 20 and keep track of how many values it has <=70. Any thoughts on how to get started would be great. I can't seem to figure out how to move from the condition using "i":

if (G[i+2][0] - G[i][0] < 20) & (G[i][1] <= 70) & (G[i+1][1] <=70) & (G[i+2][1]<=70):

to something based on the length 20 and not the indexing??

Ultimately, instead of "the first three" i'd like to keep track of all higher frequency with the minimum being "at least 3 of value <=70, consecutive in order* and in length 20 interval".

Desired Ouput:

If we have

   d[3] = {0: 61.0, 993: 65.0, 1133: 84.0, 1069: 48.0, 105:22, 107:9, 115: 24, 117:22, 200:100, 225: 241,412: 83.0, 420: 68.0, 423: 64.0, 430: 58.0}

Would result in the output:

[(105, 22), (107, 9), (115, 24),(117,22)], [(420, 68.0),(423,63),(430,58)] 
# These can be of any length as long as the overall interval of the list is <=20.

Is it OK to skip a value > 70? So, if instead of `(0, 222)` you had `(9, 222)`, would `(2, 44), (18, 44), (20, 22)` still be correct? — Lauritz V. Thaulow, Mar 10 '12 at 17:36
What if there are window candidates which overlap? (0,55),(1,55),(2,55),(3,55): the first three, the last three, and the four together each seem to satisfy the condition. What should happen in that case? — DSM, Mar 10 '12 at 19:09
@DSM, this is exactly my issue. I'd like to be able to identify these and still group them together and have a count of "4" attached to them. I'm working on a new function that I'll post soon but this is my sticking point. — Lillian Milagros Carrasquillo, Mar 10 '12 at 19:11

score 1 · Accepted Answer · edited May 23 '17 at 11:48

Here's something which might help you get started. It's loop-based, and doesn't even use zip (much less itertools.takewhile!), but hopefully will make sense:

def find_windows(d, min_elements=3,upper_length=20,max_value=70):
    G = sorted(d.items())
    for start_index in range(len(G)):
        for width in range(min_elements, len(G)-start_index+1):
            window = G[start_index:start_index+width]
            if not all(v <= max_value for k,v in window):
                break
            if not window[-1][0] - window[0][0] < upper_length:
                break
            yield window

I used "break" because as soon as we have any value > max_value or we're >= upper_length there are no more possible windows starting at start_index.

If you haven't seen yield before, it makes the function into a generator function; it's like a return where the function sends back (yields) the value and then can continue instead of stopping. (See the answers to this question for more information.)

>>> Ds = {
...     1: {0: 222, 2:44, 18: 44, 20: 22, 21:72, 105:22, 107:9, 115: 66},
...     2: {0: 61.0, 993: 65.0, 1133: 84.0, 1069: 48.0, 105:22, 107:9, 115: 24, 214:22, 206:9, 225: 241,412: 83.0, 364: 68.0, 682: 64.0, 172: 58.0} 
...     }
>>> 
>>> for idnum, d in sorted(Ds.items()):
...     print idnum, list(find_windows(d))
... 
1 [[(2, 44), (18, 44), (20, 22)], [(105, 22), (107, 9), (115, 66)]]
2 [[(105, 22), (107, 9), (115, 24)]]
>>> mydict = dict([(0,55),(1,55),(2,55),(3,55)])
>>> 
>>> for window in find_windows(mydict):
...     print window
... 
[(0, 55), (1, 55), (2, 55)]
[(0, 55), (1, 55), (2, 55), (3, 55)]
[(1, 55), (2, 55), (3, 55)]
>>> list(find_windows(mydict))
[[(0, 55), (1, 55), (2, 55)], [(0, 55), (1, 55), (2, 55), (3, 55)], [(1, 55), (2, 55), (3, 55)]]

It's still not entirely clear to me what you want to do with overlapping windows, but currently it finds them all and you can decide either within the function or in post-processing how you want to handle that.

Modifying the code to not test for whether all the values are <= max_value and count them instead should be trivial, so I'll leave that alone.

Thank you for the "yield" info, I haven't used this yet. I'll look into it as I try to build this function. I've edited the desired out put above. Basically, I want to have intervals of length 20. These intervals can have as many values as needed to capture all of those x-values that are recorded in that interval with a y-value <=70. I hope this makes sense... — Lillian Milagros Carrasquillo, Mar 10 '12 at 23:33

Lauritz V. Thaulow · Answer 2 · 2012-03-10T22:43:14.533

I've broken the problem into two parts. This first generator breaks a dictionary like your ds into ordered (key, value) lists such that each list has no values > 70. At the same time I discard chunks that have less than 3 items in them.

def split_iter(d, limit=70):
    G = list(sorted(d.iteritems()))
    start = 0
    for i, (k, v) in enumerate(G):
        if v > limit:
            if i - start >= 3:
                yield G[start:i]
            start = i + 1
    G_tail = G[start:]
    if len(G_tail) >= 3:
        yield G_tail

Now I'll use this together with bisect_right from the bisect module to quickly find the largest possible window starting with each item:

from bisect import bisect_right

def ff(d):
    for chunk in split_iter(d):
        last_end_i = 0
        for i, (k, v) in enumerate(chunk):
            end_i = bisect_right(chunk, (k + 20, 0))
            if end_i - i < 3:
                continue
            if last_end_i != end_i:
                yield chunk[i:end_i]
                last_end_i = end_i
            if end_i == len(chunk):
                break

As you see I only yield the biggest possible window. Now we put it together like this:

for idnum, ds in sorted(d.iteritems()):
    for r in ff(ds):
        print idnum, repr(r)

Hope I got it right. Output is like this:

1 [(2, 44), (18, 44), (20, 22)]
1 [(105, 22), (107, 9), (115, 66)]
2 [(105, 22), (107, 9), (115, 24)]

This is a slightly different way than I was thinking about it but VERY useful. Thank you for using Bisect, that is a new function for me and I'm glad to know it now. Your explanation is very good, by the way. — Lillian Milagros Carrasquillo, Mar 10 '12 at 23:53

Determine High Frequency Intervals

2 Answers2