How do I create a sublist that contains the last element, but uses a general formula for all other sublists of the same size?

Question

I have a long list, let's call it y. len(y) = 500. I'm not including y in the code on purpose.

For each item in y, I want to find the average value of the item and its 5 proceeding values. I run into a problem when I get to the last item on the list, because I need to use 'a+1' for one of the lines below.

a = 0
SMAlist = []
for each_item in y:
    if a > 4 and a < ((len(y))-1): # finding my averages begin at 6th item
        b = (y[a-5:a+1]) # this line doesn't work for the last item in y
        SMAsix = round((sum(b)/6),2)
        SMAlist.append(SMAsix)
    if a > ((len(y))-2): # this line seems unnecessary. How can I avoid it?
        b = (y[-6:-1]+[y[a]]) # Should I just use negative values in general?
        SMAsix = round((sum(b)/6),2)
        SMAlist.append(SMAsix)
    a = a+1

can you show us a few elements of `y` just so we know what it looks like? — Vivek Kalyanarangan, Feb 20 '18 at 07:10
Sure. y = [10406.19,10995.72,11162.55,11256.7,11634.98,12174.25,13876.47,18491.18,16908,15266.43...] — noideawhatimdoing, Feb 20 '18 at 07:13

Patrick Artner · Answer 1 · 2018-02-20T08:56:55.837

You chunkify your list and build averages over the chunks. The linked answer uses full chunks, I adapted it to build incremental ones:

Sliding avg via list comprehension:

# Inspiration for a "full" chunk I adapted: https://stackoverflow.com/a/312464/7505395
def overlappingChunks(l, n):
    """Yield overlapping n-sized chunks from l."""
    for i in range(0, len(l)):
        yield l[i:i + n]

somenums = [10406.19,10995.72,11162.55,11256.7,11634.98,12174.25,13876.47,
            18491.18,16908,15266.43]

# avg over sublist-lengths
slideAvg5 = [ round(sum(part)/(len(part)*1.0),2) for part in overlappingChunks(somenums,6)]

print (slideAvg5)

Output:

[11271.73, 11850.11, 13099.36, 14056.93, 14725.22, 15343.27, 16135.52, 
 16888.54, 16087.22, 15266.43]

I was going for a partion of the list by incremental range(len(yourlist)) before averaging the partitions, but thats as full partitioning was already solved here: How do you split a list into evenly sized chunks? I adapted it to yield incremental chunks to apply it to your problem.

What partitions are used for avg-ing?

explained = {(idx,tuple(part)): round(sum(part)/(len(part)*1.0),2) for idx,part in
             enumerate(overlappingChunks(somenums,6))}
import pprint
pprint.pprint(explained)

Output (reformatted):

# Input:
# [10406.19,10995.72,11162.55,11256.7,11634.98,12174.25,13876.47,18491.18,16908,15266.43]

# Index           partinioned part of the input list                         avg 

{(0,     (10406.19, 10995.72, 11162.55, 11256.7, 11634.98, 12174.25))    : 11271.73,
 (1,     (10995.72, 11162.55, 11256.7, 11634.98, 12174.25, 13876.47))    : 11850.11,
 (2,     (11162.55, 11256.7, 11634.98, 12174.25, 13876.47, 18491.18))    : 13099.36,
 (3,     (11256.7, 11634.98, 12174.25, 13876.47, 18491.18, 16908))       : 14056.93,
 (4,     (11634.98, 12174.25, 13876.47, 18491.18, 16908, 15266.43))      : 14725.22,
 (5,     (12174.25, 13876.47, 18491.18, 16908, 15266.43))                : 15343.27,
 (6,     (13876.47, 18491.18, 16908, 15266.43))                          : 16135.52,
 (7,     (18491.18, 16908, 15266.43))                                    : 16888.54,
 (8,     (16908, 15266.43))                                              : 16087.22,
 (9,     (15266.43,))                                                    : 15266.43}

Vivek Kalyanarangan · Answer 2 · 2018-02-20T08:29:23.860

2

Option 1: Pandas

import pandas as pd

y = [10406.19,10995.72,11162.55,11256.7,11634.98,12174.25,13876.47,18491.18,16908,15266.43]
series = pd.Series(y)
print(series.rolling(window=6, center=True).mean().dropna().tolist())

Option 2: Numpy

import numpy as np
window=6
s=np.insert(np.cumsum(np.array(y)), 0, [0])
output = (s[window :] - s[:-window]) * (1. / window)
print(list(output))

Output

[11271.731666666667, 11850.111666666666, 13099.355, 14056.930000000002, 14725.218333333332]

Timings (subject to size of data)

# Pandas
59.5 µs ± 8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# Numpy
19 µs ± 4.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# @PatrickArtner's solution
16.1 µs ± 2.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Update

Check timings code (works on Jupyter notebook)

%%timeit
import pandas as pd

y = [10406.19,10995.72,11162.55,11256.7,11634.98,12174.25,13876.47,18491.18,16908,15266.43]
series = pd.Series(y)

edited Feb 20 '18 at 08:29

answered Feb 20 '18 at 07:27

Vivek Kalyanarangan

8,951
1
23
42

1

window should be 6? not that it matters - the idea counts :) +1 – Patrick Artner Feb 20 '18 at 07:43
Re your NumPy solution: Since `y` doesn't look zero-mean to me (and even if it were) you should first take the difference and then sum to avoid loss of significance. No big deal if it's only 500 numbers but why not do it properly if it costs nothing? – Paul Panzer Feb 20 '18 at 07:45
1

@PatrickArtner Should this `y = [10406.19,10995.72,11162.55,11256.7,11634.98,12174.25,13876.47,18491.18,16908,15266.43, 15266.43]` yield `[11271.731666666667, 11850.111666666666, 13099.355, 14056.930000000002, 14725.218333333332, 15330.459999999999]` ? If yes, then I'm all set – Vivek Kalyanarangan Feb 20 '18 at 07:55
Copied your numbers into mine, mine looks different as I also average the "partial" partitions yours seems to discard - thats why I got more values then you, but the first one match up (rounding aside). – Patrick Artner Feb 20 '18 at 08:13
Hmph! It's upto the OP to decide how he wants these boundary-value conditions to behave I guess! But both are just fine given the question... – Vivek Kalyanarangan Feb 20 '18 at 08:15
How do you get this cool measurement? timeit.timeit(...) only provides me with total duration.... – Patrick Artner Feb 20 '18 at 08:26
1

@PatrickArtner I use jupyter notebook (edited answer)! Included your stats as well – Vivek Kalyanarangan Feb 20 '18 at 08:29

score 2 · Answer 3 · answered Feb 20 '18 at 08:19

A little warning wrt to @Vivek Kalyanarangan's "zipper" solution. For longer sequences this is vulnerable to loss of significance. Let's use float32 for clarity:

>>> y = (1000 + np.sin(np.arange(1000000))).astype(np.float32)
>>> window=6
>>> 
# naive zipper solution
>>> s=np.insert(np.cumsum(np.array(y)), 0, [0])
>>> output = (s[window :] - s[:-window]) * (1. / window)
# towards the end the result is clearly wrong
>>> print(output[-10:])
[1024. 1024. 1024. 1024. 1024. 1024. 1024. 1024. 1024. 1024.]
>>> 
# this can be alleviated by first taking the difference and then summing
>>> np.cumsum(np.r_[y[:window].sum(), y[window:]-y[:-window]])/window
array([1000.02936,  999.98285,  999.9521 , ..., 1000.0247 , 1000.05304,
       1000.0367 ], dtype=float32)
>>> 
# compare to last value calculated directly for reference
>>> np.mean(y[-6:])
1000.03217

To further reduce the error one could chunk y and anchor the cumsum every so-and-so many terms without losing much speed.

How do I create a sublist that contains the last element, but uses a general formula for all other sublists of the same size?

3 Answers3