0

I am trying to calculate a mean value across large numpy array. Originally, I tried:

data = (np.ones((10**6, 133))
        for _ in range(100))
np.stack(data).mean(axis=0)

but I was getting

numpy.core._exceptions.MemoryError: Unable to allocate xxx GiB for an array with shape (100, 1000000, 133) and data type float32

In the original code data is a generator of more meaningful vectors.

I thought about using dask for such an operation, hoping it will split my data into chunks backed by disk.

import dask.array as da
import numpy as np

data = (np.ones((10**6, 133)) for _ in range(100))
x = da.stack(da.from_array(arr, chunks="auto") for arr in data)
x = da.mean(x, axis=0)
y = x.compute()

However, when I run it, the process terminates with "Killed".

How can I resolve this issue on a single machine?

dzieciou
  • 4,049
  • 8
  • 41
  • 85

2 Answers2

1

You can try this approach:

agg_sum = np.zeros((10**6, 133))
total = 100

for dt in data:
    agg_sum = agg_sum + dt
_mean = agg_sum/total
MSS
  • 3,306
  • 1
  • 19
  • 50
  • 1
    Great solution! Works faster than my solution and requires less disk space. Two things I would add: (1) `del dt` to avoid runing out of memory, (2) replace `len(data)` with some `total` integer: the total number of arrays to average is known in advance, but you cannot `len` over generator. – dzieciou Aug 26 '21 at 06:35
  • 1
    I also didn't know the shape of single array (`dt`) so I initialized `agg_sum = 0 ` and it works great as well. – dzieciou Aug 26 '21 at 06:52
  • 1
    @dzieciou I know we can't take the `len` over generator. I just wanted to show the approach. – MSS Aug 26 '21 at 07:16
0

An alternative solution I found is to store all arrays in disk-backed file, using numpy.memmap.

import numpy as np

total = 100
shape = (10 ** 6, 133)
c = np.memmap(
    "total.array", dtype="float64", mode="w+", shape=(total, *shape), order="C"
)
for idx, arr in enumerate(data):
    c[idx,:,:] = arr[:]
    del arr
    
c.mean(axis=0)

The important thing here is to del arr to avoid using whole memory before garbage collector reclaims unused arrays.

Note, the solution requires around 100GB of disk space, while the solution of @MSS requires much less space by keeping only the current sum.

dzieciou
  • 4,049
  • 8
  • 41
  • 85