0

I am trying to break a numpy array into chunks with a fixed size and pad the last one with 0. For example: [1,2,3,4,5,6,7] into chunks of 3 returns [[1,2,3],[4,5,6],[7,0,0]].

The function I wrote is:

def makechunk(lst, chunk):
    result = []
    for i in np.arange(0, len(lst), chunk):
        temp = lst[i:i + chunk]
        if len(temp) < chunk:
            temp = np.pad(temp, (0, chunk - len(temp)), 'constant')
        result.append(temp)
    return result

It works but when dealing with large size array it is very slow. What is a more numpy-ish and vectorized way of doing it?

Szymon Maszke
  • 22,747
  • 4
  • 43
  • 83
J_yang
  • 2,672
  • 8
  • 32
  • 61

4 Answers4

3

Using the function resize() should do what you need :

l = np.array([1,2,3,4,5,6,7])
l.resize((3,3), refcheck=False)

(Edit: mea culpa, monday problem with reasignation)

@J: Resize boost the speed by about 5 times for np.arange(0,44100) into chunks of 512.

import math
def makechunk4(lst, chunk):
    l = lst.copy()
    l.resize((math.ceil(l.shape[0]/chunk),chunk), refcheck=False)
    return l
  • 3
    Don't assign it back. Unlike `numpy.resize`, `numpy.ndarray.size` returns `None`, as it is an in-place operation. – Chris Apr 01 '19 at 09:43
  • This returns None though – J_yang Apr 01 '19 at 09:43
  • @J_yang would be curious to know if the resize function if efficient with large size array? – Cedric Poulet Apr 01 '19 at 09:49
  • 1
    @CedricPoulet well, a little late to the party (and not the fastest code typer from what I see) but you can find time measurements of your approach below. – Szymon Maszke Apr 01 '19 at 09:58
  • 1
    @CedricPoulet I modified and tested your code: import math def makechunk4(lst, chunk): l = lst.copy() l.resize((math.ceil(l.shape[0]/chunk),chunk), refcheck=False) l.reshape(l.shape[0] * l.shape[1]) return l It is now about 5 times faster with an array of 44100 into chunk of 512 blocks. Many thanks. You can modify your answer to the code above and I will select as the best answer. :) – J_yang Apr 01 '19 at 10:16
  • @CedricPoulet and @J_yang does it return the correct answer? It returns one dimensional array while you wanted to split it. Shouldn't `l = l.reshape(l.shape[0] * l.shape[1])` be `l = l.reshape(chunk, -1)`? It would return `chunk x padded_elements` matrix this way. – Szymon Maszke Apr 01 '19 at 10:36
  • 1
    To be honest, it's not really clear what does J_yang want as an output. The reshape that he added flattern the matrix, I guess that's the output he was looking for (with the reshape, it could have just been an addition of zero at the end, so I think it's not really what he want..). But the problem was more about the way to split the data that the output. I let him judge what he need exactly. – Cedric Poulet Apr 01 '19 at 12:21
3

Time comparison of @Cedric Poulet's (all kudos to him, see his answer) solution (with added array splitting so it returns the result as desired) with another numpy approach I thought about at first (create array of zeros and insert data in-place):

import time

import numpy as np

def time_measure(func):
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        stop = time.time()
        print(f"Elapsed time: {stop-start}")
        return result

    return wrapper


@time_measure
def pad_and_chunk(array, chunk_size: int):
    padded_array = np.zeros(len(array) + (chunk_size - len(array) % chunk_size))
    padded_array[: len(array)] = array
    return np.split(padded_array, len(padded_array) / chunk_size)


@time_measure
def resize(array, chunk_size: int):
    array.resize(len(array) + (chunk_size - len(array) % chunk_size), refcheck=False)
    return np.split(array, len(array) / chunk_size)

@time_measure
def makechunk4(l, chunk):
    l.resize((math.ceil(l.shape[0] / chunk), chunk), refcheck=False)
    return l.reshape(chunk, -1)


if __name__ == "__main__":
    array = np.random.rand(1_000_000)

    ret = pad_and_chunk(array, 3)
    ret = resize(array, 3)
    ret = makechunk4(array, 3)

EDIT-EDIT

Gathering all possible answers it is indeed the case that np.split is horribly slow when compared to reshape.

Elapsed time: 0.3276541233062744
Elapsed time: 0.3169224262237549
Elapsed time: 1.8835067749023438e-05

Way of padding data is not essential, it's the split taking up most of the time.

Szymon Maszke
  • 22,747
  • 4
  • 43
  • 83
  • I am afraid the results are not the same. resize return an array of one 3 chunks, – J_yang Apr 01 '19 at 10:05
  • @J_yang are you downvoting because of simple programistic error? Wow, okay, here you go, fixed... – Szymon Maszke Apr 01 '19 at 10:11
  • I think np.split is not as fast as np.reshape: here is my edit based of Cedric: def makechunk4(lst, chunk): l = lst.copy() l.resize((math.ceil(l.shape[0]/chunk),chunk), refcheck=False) l.reshape(l.shape[0] * l.shape[1]) return l Both return the same result. But %timeit shows 39us vs 207us for np.arange(0, 44100) into chunks of 512. Thanks – J_yang Apr 01 '19 at 10:26
  • @J_yang added this answer as well (fixed `reshape` and removed data copying) and you are right, it makes tremendous difference. Please verify whether this `reshape` does what you want. – Szymon Maszke Apr 01 '19 at 10:41
0

in the itertools recipes there is a recipe for grouper:

from itertools import zip_longest
import numpy as np

array = np.array([1,2,3,4,5,6,7])

def grouper(iterable, n, fillvalue=None):
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

res = list(grouper(array, 3, fillvalue=0))
# [(1, 2, 3), (4, 5, 6), (7, 0, 0)]

if you need the sublists to be lists and not tuples:

def grouper(iterable, n, fillvalue=None):
    args = [iter(iterable)] * n
    return (list(item) for item in zip_longest(*args, fillvalue=fillvalue))
hiro protagonist
  • 44,693
  • 14
  • 86
  • 111
  • Hi, the grouper return . at 0x1c1f42f390>, which is much faster. 8us (for np.arange(0,44100) into n = 512). compared to 140us for my function. But to get the actual list of the array I need to list(grouper_return), then it becomes much slower at 3ms . Any suggestion? – J_yang Apr 01 '19 at 09:41
  • calling the grouper does not do anything yet: `zip_longest` is lazy and starts evaluation only once you iterate over it. and then it will cost some time... don't think there is anything you can do... – hiro protagonist Apr 01 '19 at 09:43
  • unfortunately this is much slower. :( – J_yang Apr 01 '19 at 09:45
-2

A solution using numpy

I assume a chunk size of 3 and created a random array input of length 10 in x.

# Chunk size
chunk = 3
# Create array
x = np.arange(10)

First make sure to pad the array with zeros. Next you can use reshape to create an array of arrays.

# Pad array
x = np.pad(x, (0, chunk - (x.shape[0]%chunk)), 'constant')
# Divide into chunks
x = x.reshape(-1, chunk)

Optionally you can retrieve the numpy array as a list

x = x.tolist()
Thijs van Ede
  • 861
  • 6
  • 15
  • This returns ValueError: cannot reshape array of size 44168 into shape (512) – J_yang Apr 01 '19 at 09:44
  • But buy padding the array before hand instead of checking it at each loop, I managed to improve time by about 5%. :) – J_yang Apr 01 '19 at 09:48