Is it possible to asynchronously transfer memory from/to GPU with cupy
(or chainer
)?
I'm training a relatively small network with very large data that does not fit into the GPU memory. This data should be kept on CPU memory and provided to GPU for its minibatch calculation sequentially.
The memory transfer time is the dominant bottleneck of this application. I think the asynchronous memory transfer solves this problem, i.e. during the calculation of one minibatch, another minibatch is transferred to GPU in the background.
I'm wondering it would be possible with cupy.cuda.Stream
class, but I have no idea yet.
I would appreciate any comments/advice.
EDIT: I thought the following codes makes asynchronous memory transfer, but not.
import numpy as np
import cupy as cp
a_cpu = np.ones((10000, 10000), dtype=np.float32)
b_cpu = np.ones((10000, 10000), dtype=np.float32)
a_stream = cp.cuda.Stream(non_blocking=True)
b_stream = cp.cuda.Stream(non_blocking=True)
a_gpu = cp.empty_like(a_cpu)
b_gpu = cp.empty_like(b_cpu)
a_gpu.set(a_cpu, stream=a_stream)
b_gpu.set(b_cpu, stream=b_stream)
# This should start before b_gpu.set() is finished.
a_gpu *= 2
The nvvp shows the memory transfer takes place sequentially.