efficiently convert big list with differently sized sublists to padded NumPy array

Question

I retrieve a big amount (>100.000) of time series a database. One time series is a 2D List with 5 to 10 entries, each entry holds 8 values:

single_time_series = [
       [  43, 1219, 1065,  929, 1233, 2604, 3101, 2196],
       [  70, 1148, 1041,  785, 1344, 2944, 3519, 3506],
       [  80, 1148,  976,  710, 1261, 2822, 3335, 3247],
       [ 103, 1236, 1090,  762, 1024, 2975, 3777, 3093],
       [ 120,  883,  937,  493, 1221, 4119, 5241, 5133],
       [ 143, 1110, 1089,  887, 1420, 2471, 2905, 2845]

] # a time series with 6 entries, each entry represents one day

For further processing I want all of these individual time series together in one 3D numpy array. But since the length of each series may vary between 5 and 10 entries I need to pad every time series that is shorter than 10 with zero-filled-arrays:

[
  [  43, 1219, 1065,  929, 1233, 2604, 3101, 2196],
  [  70, 1148, 1041,  785, 1344, 2944, 3519, 3506],
  [  80, 1148,  976,  710, 1261, 2822, 3335, 3247],
  [ 103, 1236, 1090,  762, 1024, 2975, 3777, 3093],
  [ 120,  883,  937,  493, 1221, 4119, 5241, 5133],
  [ 143, 1110, 1089,  887, 1420, 2471, 2905, 2845],
  [   0,    0,    0,    0,    0,    0,    0,    0],
  [   0,    0,    0,    0,    0,    0,    0,    0],
  [   0,    0,    0,    0,    0,    0,    0,    0],
  [   0,    0,    0,    0,    0,    0,    0,    0]
]

Currently I'm achieving this by iterating over each time series coming from the database, padding it and appending it to the final numpy array:

MAX_SEQUENCE_LENGTH = 10
all_time_series = ... # retrieved from db

all_padded_time_series = np.array([], dtype=np.int64).reshape(0, MAX_SEQUENCE_LENGTH, 8) 

for single_time_series in all_time_series:
  single_time_series = np.array(single_time_series, dtype=np.int64)

  length_diff = MAX_SEQUENCE_LENGTH - single_time_series.shape[0]

  if length_diff > 0:
    single_time_series = np.pad(single_time_series, ((0, length_diff), (0,0)), mode='constant')

  all_padded_time_series = np.append(all_padded_time_series, [single_time_series], axis=0)

While the database request executes in a matter of seconds, the padding and appending operations take for ever – the script needs ~45 minutes for ~100.000 time series on my iMac.

Since the database keeps growing I need to analyse even more data in the near future. So I'm looking for a faster way to convert the list coming from the db to a numpy array. I'm pretty sure there is a more efficient way to do this – any ideas?

Did you try `numpy.concatenate` instead? In my experience, `numpy.pad` is very slow for this kind of padding. — bantmen, Oct 29 '17 at 14:28
Also relevant: https://stackoverflow.com/questions/32037893/numpy-fix-array-with-rows-of-different-lengths-by-filling-the-empty-elements-wi — bantmen, Oct 29 '17 at 14:36
Repeated `np.append` is too slow. Better to copy each array to a 3d `zeros` array. Or adapt @Divakars 2d solution. You already know the padded array shape. — hpaulj, Oct 29 '17 at 16:03

naoki fujita · Answer 1 · 2017-10-29T14:48:09.600

0

import numpy as np
ll = [[43, 1219, 1065, 929, 1233, 2604, 3101, 2196],
      [70, 1148, 1041, 785, 1344, 2944, 3519, 3506],
      [80, 1148, 976, 710, 1261, 2822, 3335, 3247],
      [103, 1236, 1090, 762, 1024, 2975, 3777, 3093],
      [120, 883, 937, 493, 1221, 4119, 5241, 5133],
      [143, 1110, 1089, 887, 1420, 2471, 2905, 2845]] 
      # your input list of lists from a database

def a(l):
    a = np.zeros((10, 8), dtype=np.int64)
    np.copyto(a[0:len(l), 0:8], l)
    return a 
    # my solution for this problem.
    # this solution initializes nd-array at a time,
    # so this may enable to rid the re-creation cost of nd-array.

def b(l):
    a = np.array(l, dtype=np.int64)
    len_diff = 10 - a.shape[0]
    return np.pad(a, ((0, len_diff), (0, 0)), mode='constant') 
    # your solution for this problem

I want to profile and compare these codes,but profiling doesn't works well (because of cpu caching).

edited Oct 29 '17 at 14:48

answered Oct 29 '17 at 14:30

naoki fujita

689
1
9
13

I watch your(csch's) question again carefully, I'm find that we can create 3d array at a once because we just have to count the number of all_time_series to define the shape of 3d numpy array. And we can copy all data to 3darray one by one. – naoki fujita Oct 29 '17 at 15:08
Or if we don't need re-calculation or this calculation doesn't relate to past hysteresis, I try to use lazy sequence (i.e. iterable) or save the intermediate calculation for this purpose. – naoki fujita Oct 29 '17 at 15:14
Because given input data already convert to numpy's array, that solution is forced to re-format(re-create) a numpy's array. In addition, input data is numpy's array of python's list , this is not numpy's 2d array. I think that solution is possible to be improved by avoiding the re-creation of array. – naoki fujita Oct 29 '17 at 16:33

hpaulj · Accepted Answer · 2017-10-29T17:14:56.727

You have 2 major time consumers - the np.append and np.pad. This append creates a new array each time. It does not grow a list as does list.append. pad is ok, but more general that what you need, and thus slower.

Since you know the target dimensions, make a zero filled array right away and the copy your lists

all_padded_time_series = np.zeros((len(all_time_series, MAX_SEQUENCE_LENGTH, 8), dtype=np.int64)

for i, single_time_series in enumerate(all_time_series):
  single_time_series = np.array(single_time_series, dtype=np.int64)
  all_padded_time_series[i, :single_time_series.shape[0], :] = single_time_series

or letting the copy do the conversion to array:

for i, single_time_series in enumerate(all_time_series):
  all_padded_time_series[i, :len(single_time_series), :] = single_time_series

Comments link to a good solution by @Divakar. It copies all component arrays to the target at once using a mask. As written it assumes the components are 1d, but it could be adapted to this 2d case. But the logic is harder to understand and remember (even though I've recreated it several times).

itertools.zip_longest is also useful with joining lists of differing lengths, but it's not working nicely in this 2d case.

In [269]: alist = [(np.ones((i,4),int)*i).tolist() for i in range(1,5)]
In [270]: alist
Out[270]: 
[[[1, 1, 1, 1]],
 [[2, 2, 2, 2], [2, 2, 2, 2]],
 [[3, 3, 3, 3], [3, 3, 3, 3], [3, 3, 3, 3]],
 [[4, 4, 4, 4], [4, 4, 4, 4], [4, 4, 4, 4], [4, 4, 4, 4]]]
In [271]: res = np.zeros((4,4,4),int)
In [272]: for i,x in enumerate(alist):
     ...:     res[i,:len(x),:] = x
     ...:     
In [273]: res
Out[273]: 
array([[[1, 1, 1, 1],
        [0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0]],

       [[2, 2, 2, 2],
        [2, 2, 2, 2],
        [0, 0, 0, 0],
        [0, 0, 0, 0]],

       [[3, 3, 3, 3],
        [3, 3, 3, 3],
        [3, 3, 3, 3],
        [0, 0, 0, 0]],

       [[4, 4, 4, 4],
        [4, 4, 4, 4],
        [4, 4, 4, 4],
        [4, 4, 4, 4]]])

Adapting Numpy: Fix array with rows of different lengths by filling the empty elements with zeros

The mask is calculated with:

In [291]: mask = np.arange(4)<np.array([len(x) for x in alist])[:,None]
In [292]: mask
Out[292]: 
array([[ True, False, False, False],
       [ True,  True, False, False],
       [ True,  True,  True, False],
       [ True,  True,  True,  True]], dtype=bool)

In effect it selects res[0,:1,:], res[1,:2,:], etc; which we can verify by looking at res from the above calculation:

In [293]: res[mask]
Out[293]: 
array([[1, 1, 1, 1],
       [2, 2, 2, 2],
       [2, 2, 2, 2],
       [3, 3, 3, 3],
       [3, 3, 3, 3],
       [3, 3, 3, 3],
       [4, 4, 4, 4],
       [4, 4, 4, 4],
       [4, 4, 4, 4],
       [4, 4, 4, 4]])

We can get the same 2d array by concatenating the list into one long 2d array:

In [294]: arr = np.concatenate(alist, axis=0)

And thus do the masked assignment with:

In [295]: res[mask] = arr

The mask calculation is harder to visualize and remember.

Nice solution, it would be interesting to profile this versus OP's solution and compare the timings although yours should obviously be faster. — bantmen, Oct 29 '17 at 17:13
I tried the zero-filled-array solution which cut it down to ~40 seconds — csch, Oct 29 '17 at 19:30
The masking looks interesting too and might be interesting for some other things I'm up to – thanks for the help :-) — csch, Oct 29 '17 at 19:31

efficiently convert big list with differently sized sublists to padded NumPy array

2 Answers2