4

I'd like to use scipy.ndimage.watershed_ift on an image that is much too big to fit into memory. Is my only option to split my image into tiles, and process the tiles individually? For this to work, I'd need to figure out how to deal the the edges of my tiles. The tiles would need to overlap a bit, and I'd have to do be smart about how to stitch them back together.

Is there a generic approach to handing large arrays off to NumPy and SciPy functions?

ajwood
  • 18,227
  • 15
  • 61
  • 104
  • how big is the image? (is it one image, or an image-cube of several tens of Gb?) Did you profile your code yet? (doing stuff like a+=b instead of c=a+b, saves you a load of memory, and if memory is your bottleneck that also means it speeds up stuff) – usethedeathstar Aug 29 '13 at 06:47
  • damn, cant edit previous comment, just thought of it: did you consider buying more ram? Going from (for example) 4Gb to 16Gb or so can work wonders and does not cost that much. Or is the image too big if you consider even that option? For more specific help we need more info on how you implement everything etc. – usethedeathstar Aug 29 '13 at 07:13
  • @usethedeathstar: Right now, I'm trying to run `ndimage.distance_transform_edt` on a 30000x30000 image, and runs me out of 12GB of memory. As far as I can tell, I'm stuck being responsible for tiling/processing/stitching.. – ajwood Sep 03 '13 at 15:33

1 Answers1

1

Yes, numpy.memmap is a generic approach to deal with large arrays above you memory limits...

You can check this answer:

or this other one:

both explaining in more details how to use numpy.memmap.

Community
  • 1
  • 1
Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
  • Does this leave the responsibility to me for grabbing reasonable sized chunks? For example, could I pass a `memmap`, or `pyhdf5` object off to `scipy.ndimage.label()` and expect it to do the right thing without running me out of memory? – ajwood Aug 29 '13 at 12:47
  • @ajwood Yes, unless the process creates a copy of the array along the run. You can pass the whole `memap` to the process and it will actually access the data from your hard-drive. – Saullo G. P. Castro Aug 29 '13 at 12:52
  • I don't think the SciPy `watershed` and `label` functions will run on the `memmap` buffers; I'm forced to always have at least one copy of the whole array in memory. I believe my only solution is tiling them on my end, and then stitching them back together in a smart way... – ajwood Sep 03 '13 at 15:17
  • @ajwood Have you monitored the memory usage along the run with `watershed` and `label`? I have never tried these functions so far... – Saullo G. P. Castro Sep 03 '13 at 15:58
  • not those ones; I'm stuck on the distance transform so far, which in my algorithm comes before those function. – ajwood Sep 03 '13 at 18:25