0

I am generating a very large matrix of permutations based on the numbers 0,1, and 2. Ideally, I want them 48 wide but even happy with 20. Right now, I run out of memory so wanted to see if there was a non-memory based option available.

I did try it on a large memory machine but I would prefer an option that uses disk (I have a fast SSD) even if it takes a long time. As you see, I have already made it an int8 instead of int32 to save some space but even that only went so far. Grid_size of 18 is 6.5G where 19 is 21G so I understand it grows exponentially and I will be a TB soon.

import numpy as np

class LargeGrid():
    def permgrid(self, m, n):
        inds = np.indices((m,) * n, dtype=np.int8)
        arr = inds.reshape(n, -1).T
        arr[arr == 2] = -1
        return arr

    def save(self):
        grid_size = 24
        grid = self.permgrid(3,grid_size)
        np.save('grid', grid)

lg = LargeGrid()
lg.save()

Expected a very large file on disk but ended up with:

Python 3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)] on win32
MemoryError: Unable to allocate array with shape (24, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3) and data type int8
Ian Connor
  • 76
  • 3
  • 1
    `numpy` allows a max of 32 dimensions. – hpaulj Sep 24 '19 at 05:00
  • 1
    When you ask about an error like this, it's polite to provide the traceback. I went ahead tested part of your code, and got the error in `np.indices`, where it tries to `res = np.empty(24,followed by 24 3's)`. It has to all those indices somewhere. – hpaulj Sep 24 '19 at 05:53
  • 1
    With `sparse=True`, `np.indices` will return a tuple of 24 arrays, each with 3 elements and `ndim=24`. – hpaulj Sep 24 '19 at 05:58
  • 2
    You could try `for i in itertools.product([0,1,3], repeat=24): `. – hpaulj Sep 24 '19 at 06:05
  • @hpaulj - this was a great suggestion and has given me the idea to refactor - my approach was not ideal and using such a large matrix will not scale. – Ian Connor Sep 24 '19 at 10:34

1 Answers1

0

Looks like your problem is the large size of the array. So, make smaller arrays and write them to and hdf5 (Hierarchical Data Format) file using h5py.

This will allow you to manage the amount of space your current array takes from the memory while being able to store the entire large array on the file system, in the hdf5 file (largefile.h5).

Alternatively, you could also use pyTables instead of h5py to store your data in hdf5 file format.

References

I would encourage you to look at these:

h5py

  1. Input and output numpy arrays to h5py
  2. h5py Intro
  3. h5py Documentation

pyTables

  1. Python: how to store a numpy multidimensional array in PyTables?
  2. pyTables Documentation
  3. Example of storing numpy array using pytables
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
CypherX
  • 7,019
  • 3
  • 25
  • 37