Assume we generate our own training data (by sampling from some diffusion process and computing some quantities of interest on it for example) and that we have our own CUDA routine called generate_data which generates labels in GPU memory for a given set of inputs.
Hence, we are in a special setting where we can generate as many batches of training data as we want in an "online" fashion (at each batch iteration we call that generate_data routine to generate a new batch and discard the old batch).
Since the data is generated on the GPU, is there a way to make TensorFlow (the Python API) directly use it during the training process ? (for example to fill a placeholder) That way, such a pipeline would be efficient.
My understanding is that currently you would need in such a setup to copy your data from GPU to CPU, and then let TensorFlow copy it again from CPU to GPU, which is rather wasteful as unnecessary copies are being performed.
EDIT: if it helps, we can assume that the CUDA routine is implemented using Numba's CUDA JIT compiler.