Memory efficient Python batch processing

Question

question

I wrote a small python batch processor, that loads binary data, performs numpy operations and stores the results. It consumes much more memory, than it should. I looked at similar stack-overflow discussions and would like to ask for further recommendations.

background

I convert spectral data to rgb. The spectral data is stored in a Band Interleaved by Line (BIL) Image File. That is why I read and process the data line by line. I read the data using the Spectral Python Library, which returns a numpy arrays. hyp is a descriptor of a large spectral file : hyp.ncols=1600, hyp.nrows=3430, hyp.nbands=160

code

import spectral
import numpy as np
import scipy

class CIE_converter (object):
   def __init__(self, cie):
       self.cie = cie

    def interpolateBand_to_cie_range(self, hyp, hyp_line):
       interp = scipy.interpolate.interp1d(hyp.bands.centers,hyp_line, kind='cubic',bounds_error=False, fill_value=0)
       return interp(self.cie[:,0])

    #@profile
    def spectrum2xyz(self, hyp):
       out = np.zeros((hyp.ncols,hyp.nrows,3))
       spec_line = hyp.read_subregion((0,1), (0,hyp.ncols)).squeeze()
       spec_line_int = self.interpolateBand_to_cie_range(hyp, spec_line)
       for ii in xrange(hyp.nrows):
          spec_line = hyp.read_subregion((ii,ii+1), (0,hyp.ncols)).squeeze()
          spec_line_int = self.interpolateBand_to_cie_range(hyp,spec_line)
          out[:,ii,:] = np.dot(spec_line_int,self.cie[:,1:4])
       return out

memory consumption

All the big data is initialised outside the loop. My naive interpretation was that the memory consumption should not increase (Have I used too much Matlab?) Can someone explain me the increase factor 10? This is not linear,as hyp.nrows = 3430. Are there any recommendations to improve the memory management?

  Line #    Mem usage    Increment   Line Contents
  ================================================
  76                                 @profile
  77     60.53 MB      0.00 MB       def spectrum2xyz(self, hyp):
  78    186.14 MB    125.61 MB           out = np.zeros((hyp.ncols,hyp.nrows,3))
  79    186.64 MB      0.50 MB           spec_line = hyp.read_subregion((0,1), (0,hyp.ncols)).squeeze()
  80    199.50 MB     12.86 MB           spec_line_int = self.interpolateBand_to_cie_range(hyp, spec_line)
  81                             
  82   2253.93 MB   2054.43 MB           for ii in xrange(hyp.nrows):
  83   2254.41 MB      0.49 MB               spec_line = hyp.read_subregion((ii,ii+1), (0,hyp.ncols)).squeeze()
  84   2255.64 MB      1.22 MB               spec_line_int = self.interpolateBand_to_cie_range(hyp, spec_line)
  85   2235.08 MB    -20.55 MB               out[:,ii,:] = np.dot(spec_line_int,self.cie[:,1:4])
  86   2235.08 MB      0.00 MB           return out

notes

I replaced range by xrange without drastic improvement. I'm aware that a cubic interpolation is not the fastest, but this is not about CPU consumption.

What OS are you using? On Linux, you need to know if the memory reported includes [buffers and cache](http://blog.scoutapp.com/articles/2010/01/11/free-memory-on-linux-free-m-vs-proc-meminfo) otherwise the numbers can be misleading. — unutbu, Nov 15 '12 at 13:15
You can save some CPU time and reduce heap fragmentation by using output parameter in numpy functions (e.g. `np.dot(spec_line_int, self.cie[:,1:4], out=out[:,ii,:])`) to avoid allocation/deallocation of temporary numpy arrays. (I don't really believe it'll solve your problem, but who knows?) — atzz, Nov 15 '12 at 14:19
According to documentation `spectral` library does not read the data until it is really accessed: http://spectralpython.sourceforge.net/fileio.html#module-spectral.io.spyfile. This may explain why memory consumption is increasing while processing the data. — btel, Nov 15 '12 at 14:51

score 1 · Answer 1 · answered Dec 05 '12 at 12:14

Thanks for the comments. They all helped me to improve the memory consumption a little. But eventually I figured out what the main reason for the Memory consumption was/is:

SpectralPython Images contain a Numpy Memmap object. This has the same format, as the data structure of the hyperspectral data cube. (in case of a BIL format (nrows, nbands, ncols)) When calling:

spec_line = hyp.read_subregion((ii,ii+1), (0,hyp.ncols)).squeeze()

the image is not only returned as a numpy array return value, but also cached in hyp.memmap. A second call would be faster, but in my case the memory just increases until the OS is complaining. As the memmap is actually a great implementation, I will take direct advantage of it in future work.

As of [spectral](http://spectralpython.net) version 0.15.0, all file read methods (such as `read_subregion` in your question) accept an optional `use_memmap` argument, which can be set to False to avoid using the memmap interface. This will cause the method to make direct file reads instead, which will often be slower than the memmap but will reduce memory consumption. — bogatron, Jun 16 '14 at 21:21