Consider the problem of creating a dataset of sampling random small image patches from a directory of high-resolution images. The Tensorflow dataset API allows for a very easy way of doing this, by constructing a dataset of image names, shuffling them, mapping it to loaded images, then to random cropped patches.
However, this naive implementation is very inefficient as a separate high-resolution image will be loaded and cropped to generate each patch. Ideally an image could be loaded once and reused to generate many patches.
One simple way that was discussed previously is to generate multiple patches from an image and flatten them. However this has the unfortunate effect of biasing the data too much. We want each training batch to come from different images.
Ideally what I would like is a "random caching filter" transformation that takes an underlying dataset and caches N elements of it into memory. Its iterator will return a random element from the cache. Also, with pre-defined frequency it will replace a random element from the cache with a new one from the underlying dataset. This filter will allow for faster data access at the expense of less randomization and higher memory consumption.
Is there such functionality available?
If not, should it be implemented as a new dataset transformation or simply a new iterator? It seems a new iterator is all that is needed. Any pointers on how to create a new dataset iterator, ideally in C++?