How to make a for-loop, which iterates over 15mln records, space-efficient?

Question

I have an issue with a simple for-loop. I'm trying to calculate maximum values from a list (i.e. list of shifting windows), create a list of those max values, which I'll later add to the data frame.

My data frame has two columns of float values, and datetime index. The data file has around 15 million rows (i.e. the length of the series that I want to iterate over is 15mln) (700 MB).

When I run my simple loop after some time my computer runs out of memory and crashes. I have 12 GB of RAM.

My code:

import pandas as pd
import numpy as np

# sample data
speed = np.random.uniform(0,25,15000000)

data_dict = {'speed': speed}
df = pd.DataFrame(data_dict)

# create a list of 'windows', i.e. subseries of the list 
def GetShiftingWindows(thelist, size):
    return [ thelist[x:x+size] for x in range( len(thelist) - size + 1 ) ]

window_size = 10

list_of_win_speeds = GetShiftingWindows(df.speed, window_size)

list_of_max_speeds = []

for x in list_of_win_speeds: 
    max_value = max(x)
    list_of_max_speeds.append(max_value)

I'm not a CS major. It seems to me like a space-complexity issue. What am I missing here to make it feasible to compute?

Paweł Kordowski · Answer 1 · 2017-03-20T07:42:09.957

As a first step I would change

return [ thelist[x:x+size] for x in range( len(thelist) - size + 1 ) ]

into

return ( thelist[x:x+size] for x in range( len(thelist) - size + 1 ) )

then you will end up with a generator, your code is creating the whole list of sublists in memory, the generator approach will produce only one sublist per for iteration

if you use Python 2 you can also change range (generates the whole list at once) into xrange (again generator producing only one value per call)

finally you can return a generator of iterators using islice:

from itertools import islice

and

return ( islice(thelist, x, x + size) for x in range( len(thelist) - size + 1 ) )

score 1 · Accepted Answer · edited May 23 '17 at 12:32

First of all you should be using the pandas aggregation functions rather than trying to iterate through the list and doing it on your own. It's not clear what exactly this function is supposed to do:

def GetShiftingWindows(thelist, size):
    return [ thelist[x:x+size] for x in range( len(thelist) - size + 1 ) ]

but what it does do is to create a very large dictionary. Consider investing in yield. When you use yield you are not storing this large dictionary in memory.

def GetShiftingWindows(thelist, size):
    for x in range( len(thelist) - size + 1 ):
        yield thelist[x:x+size]

and you can use xrange() instead of range() to squeeze out another few bytes.

The advantage of yield and xrange is that it does not store a list in memory. Instead produce a lazily evaluated iterable that has a smaller memory requirement.

How to make a for-loop, which iterates over 15mln records, space-efficient?

2 Answers2