3

Suppose that I am working with very large array (e.g., ~45GB) and am trying to pass it through a function which open accepts numpy arrays. What is the best way to:

  1. Store this for limited memory?
  2. Pass this stored array into a function that takes only numpy arrays?
demongolem
  • 9,474
  • 36
  • 90
  • 105
Andy
  • 175
  • 1
  • 7
  • 1
    You could use generators instead of lists. Can you tell us where do you get the array from? –  Oct 27 '16 at 18:07
  • @kiran.koduru: This is NumPy. Both generators and lists should be avoided. Arrays are something completely different. – user2357112 Oct 27 '16 at 18:09
  • Have you tried pyTables like in http://stackoverflow.com/questions/1053928/very-large-matrices-using-python-and-numpy ? – geompalik Oct 27 '16 at 18:24
  • Well here's some background: I am trying to train a very large feature array for hidden markov model: 700 x (400 x 4122), where each 400x4122 mini-array is a sequence of observed samples across 400 time stamps with 4122 features. There is a total of 700 such sequences, which amounts to ~45GB of memory, when concatenated. My question is: how do you work with array of this size? – Andy Oct 27 '16 at 18:35
  • My biggest concern with using things like pytable (and maybe memmap) is: can you feed them directly to functions that work with numpy arrays? – Andy Oct 27 '16 at 18:37
  • can you post the usage of this hmmlearn package as if the entire thing could be loaded into memory? also what form is the data in currently? do you already have a single 45 GB file, or multiple smaller files? – Aaron Oct 27 '16 at 18:51

1 Answers1

1

TLDR; just try it...

I know nothing about hidden markov models, but as for numpy mmap's you may find it'll just work. I say this because np.memmap is a direct subclass of ndarray. That said, even in the documentation, it is stated that it "does not quite fit the ndarray subclass" and suggests it is possible to create the mmap object yourself with mmap.mmap(...). IMAO after looking at the numpy.memmap.__new__() function, there's not much more you could do to make it a drop in replacement, in which case you'll have to take a look at the functions you want to use, and why mmap arrays are not playing nice. If that happens, it may even be easier to alter those files than alter the way mmap is applied.

As a final note, when working directly from disk (even buffered) get ready for some slow computation times... I would suggest finding the appropriate source code and hacking in a progress indication to the computationally expensive partitions. Also incremental writeback can save you from re-computing large partitions of data if an error (or just a power outage) occurs.

Here's an example of how I might add progress reporting to GaussianHMM().fit():

additions are in bold

changes to hmmlearn\base.py:

class _BaseHMM(BaseEstimator):
    # ...
    def fit(self, X, lengths=None):
        # ...
        for iter in range(self.n_iter):
            stats = self._initialize_sufficient_statistics()
            curr_logprob = 0
            for i, j in iter_from_X_lengths(X, lengths, iter, self.n_iter): # tell our generator which iteration
                # ...
                pass

changes to hmmlearn\utils.py

def iter_from_X_lengths(X, lengths, iteration, stop):
    if lengths is None:
        yield 0, len(X)
        print("completion: 100%")
    else:
        length = len(lengths) #used every loop so I copied it to a local var
        n_samples = X.shape[0]
        end = np.cumsum(lengths).astype(np.int32)
        start = end - lengths
        if end[-1] > n_samples:
            raise ValueError("more than {0:d} samples in lengths array {1!s}"
                             .format(n_samples, lengths))

        for i in range(length):
            yield start[i], end[i] 
            # convert loop iterations to % completion
            print("completion: {}%".format(int((float(iteration)/stop)+(float(i)/length/stop))*100))
Aaron
  • 10,133
  • 1
  • 24
  • 40
  • never give up on python for big data, I've used it to slog through some heavy image processing on 10's of TB of images in less than two days of processor time on my laptop :D (I was actually bottle-necked by the network storage connection speed) – Aaron Oct 27 '16 at 19:22
  • Wow. Thanks for this detailed answer Aaron. I'm trying this out as soon as I can. – Andy Oct 27 '16 at 21:42
  • FYI, I made a post a little earlier today about the hmmlearn part of the question. Here's the link to it: http://stackoverflow.com/questions/40294642/python-passing-multiple-large-sequences-through-hmmlearn – Andy Oct 27 '16 at 21:45
  • @Andy reading the source code to figure out how things are implemented is a great way to learn new python tricks. Generally most of the implementation is in python with only small segments that need high performance in c. even most of numpy is in native python (and is pretty well commented) – Aaron Oct 28 '16 at 13:40