3

I'm working with a bunch of numpy arrays that don't all fit in RAM, so I need to periodically save them to and load them from the disk.

Usually, I know which ones I'll need to read ahead of time, so I'd like to hide the latency by issuing something like a "prefetch" instruction in advance.

How should I do this?


(There is a similar question related to TensorFlow: However, I am not using TensorFlow, and so I wouldn't want to create a dependency on it)

Community
  • 1
  • 1
MWB
  • 11,740
  • 6
  • 46
  • 91
  • 1
    "almost all programming can be viewed as an exercise in caching" ... – Corley Brigman Mar 15 '16 at 01:25
  • @ali_m I/O is of whole arrays, no memcpy, but I'm flexible in my choices regarding the rest – MWB Mar 15 '16 at 02:12
  • 2
    Your question is still rather vague. Memory-mapped arrays (`numpy.memmap`) and HDF5 (PyTables, h5py) are two options you should probably consider, but you're going to have to get much more specific about your problem if you want a concrete answer. – ali_m Mar 15 '16 at 02:19
  • @ali_m I thought I answered your question. What's still vague? – MWB Mar 15 '16 at 02:23
  • @ali_m edited anyway -- hope this clarifies the question – MWB Mar 15 '16 at 02:28
  • 1
    If I interpret your question in the narrow sense of the title, you can "prefetch" the arrays just by reading them off the disk in the normal fashion at some point before they're needed. If they're `.npy` files you could call `np.load`, if they're pickles then you would open the files and unpickle them etc. (not that I particularly recommend using pickling for this purpose). You get to choose exactly how and when this happens, and there's no special magic required. This is essentially what's being done in the TensorFlow answer you linked to. Does that answer your question? – ali_m Mar 15 '16 at 02:41
  • @ali_m I don't think you'll get any latency hiding this way. You still have to wait, just earlier. I want to keep working in the main thread, while the "prefetch" causes another thread to do the I/O. (Without the complexity that direct multithreaded programming usually entails) – MWB Mar 15 '16 at 02:48
  • Well, as I said, you get to choose how and when this happens. When *should* a particular array be fetched from disk? Hopefully there are ways to predict which array(s) might be needed next, which would let you "prefetch" them and thereby hide the IO latency. However I don't know anything about your code, so I can't answer that question. – ali_m Mar 15 '16 at 02:55
  • @ali_m *"Hopefully there are ways to predict which array(s) might be needed next"* -- looks like you didn't even read the question, but you are complaining that it's vague. – MWB Mar 15 '16 at 03:05
  • Well then I really don't understand what the question is. If you know which data you need next, and you know how to read it off the disk, then you already know how to do prefetching. What else is there to say? – ali_m Mar 15 '16 at 03:11
  • I don't know of any ready-to-go capabilities; it's not complicated, but not necessarily easy either. you implement a thread-safe data structure. explicit accesses block; prefetch accesses launch a separate thread to read from disk. needs to comprehend LRU to throw out oldest data when it needs to make room for new data, and ensure that an explicit access hitting an implicit read in progress works correctly. – Corley Brigman Mar 15 '16 at 14:41

3 Answers3

3

If you're using Python 3.3+ on a UNIX-like system, you can use os.posix_fadvise to initiate a prefetch after opening a file. For example:

with open(filepath, 'rb') as f:
    os.posix_fadvise(f.fileno(), 0, os.stat(f.fileno()).st_size, os.POSIX_FADV_WILLNEED)

    ... do other stuff ...

    # If you're lucky, OS has asynchronously prefetched file contents
    stuff = pickle.load(f)

Aside from that, Python doesn't directly offer any APIs for explicit prefetch, but you could use ctypes to manually load an OS appropriate prefetch function, or use a background thread that does nothing but read and discard blocks from the file to improve the odds that the data is in the system cache.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
0

[disclaimer: shameless self-advertising here :-)] I have written a library that should help on this, and it is compatible with python 2.7: documentation/repository

You can use its prefetch function which does what it says, prefetch some values:

files = ['file1.npy', 'file2.npy', 'file3.npy']

def next_to_preload(current_idx):
    return (current_idx + 1) % 3

loaded = seqtools.smap(np.load, files)  # behaves like a list but elements are computed on-demand
preloaded = seqtool.prefetch(
    loaded, 
    max_buffered=10,
    direction=(0, next_to_preload))

for i in range(3):
    print(preloaded[i])

It has a few more options if you want to switch from threads to processes etc.

Note that fetching an item different from the one provisioned according to next_to_preload will reset the buffer.

pixelou
  • 748
  • 6
  • 17
0

You can load the numpy array file (file_name.npy) in read mode. This will not brin g the file in RAM or computing memory but will get the reference in RAM and will refer the array file in disk memory only. We can iterate the array same as we fetch in RAM itself, but the good thing about load numpy file in read mode is calculation and iteration will not effect run time memory.

import numpy as np

FILE_PATH = "path/file_name.npy"
numpy_array = np.load(file_path, mmap_mode='r')
# to do operations on numpy array with the same dimension of matrix(axis=0)
numpy_array = np.append(numpy_array, calculated_new_matrix, axis = 0)
# to save the file back into the same file path
np.save(FILE_PATH, numpy_array)

This can save the runtime memory and you can run the numpy array operations in batch_size as well for the large size of array file to save calculations complexities and memory for efficiency.

YAP
  • 451
  • 5
  • 6