How to use RAPIDS to speed up the modules separated by container, in a pipeline

Question

We have a function that allows users to drag and drop a module through the UI interface to form a data processing pipeline, such as reading data, doing preprocessing, doing classification training, etc. After dragging/dropping, these modules will be executed sequentially.

Each module will start a container (via k8s) to run, the results processed by the previous module are saved to cephfs as a file, and the next module reads the file and then performs the operation. This serialization/deserialization process is slow. We plan to use RAPIDS to speed up this pipeline: to improve Inter-module data exchange by putting the data in the GPU MEM. And using cuDF/cuML instead of Pandas/SKLearn to get faster processing speed.

Currently, we have confirmed that these modules can be port from Pandas/SKLearn to cuDF/cuML, but because each module is running in a container, once the module finishes running, the container disappears and the process disappears too, so, the corresponding cuDF data cannot continue to exist in the GPU MEM.

In this case, if you want to use RAPIDS to improve the pipeline, is there any good advice?

flips · Answer 1 · 2019-04-04T19:13:37.733

If you want processes to spawn and die and the memory for them to be accessed remotely then you need something that holds your data during the interim. One solution would be to build a process that will make your allocations and then you create cudf columns from ipc. I am not sure how to do this in python. In c++ it is pretty straight forward.

Something along the lines of

//In the code handling your allocations
gdf_column col;

cudaMemHandle_t handle_data, handle_valid;
cudaIpcGetMemHandle(&handle,col.data);
cudaIpcGetMemHandle(&valid,col.valid);



//In the code consuming it
gdf_column col;

//deserialize these by reading from a file or however you want to make this 
//binary data avaialable
cudaMemHandle_t handle_data, handle_valid;

cudaIpcOpenMemHandle ( (void**) &col.data, cudaIpcMemHandle_t handle, cudaIpcMemLazyEnablePeerAccess );
cudaIpcOpenMemHandle ( (void**) &col.valid, cudaIpcMemHandle_t handle, cudaIpcMemLazyEnablePeerAccess );

There are also third party solutions from RAPIDs contributors like BlazingSQL which provide this functionality in python as well as to providing a SQL interface to cudfs.

Here you would do something like

#run this code in your service to basically select your entire table and get it
#as a cudf
from blazingsql import BlazingContext
import pickle
bc = BlazingContext()
bc.create_table('performance', some_valid_gdf) #you can also put a file or list of files here
result= bc.sql('SELECT * FROM main.performance', ['performance'])
with open('context.pkl', 'wb') as output:
    pickle.dump(bc, output, pickle.HIGHEST_PROTOCOL)
with open('result.pkl', 'wb') as output:
    pickle.dump(result, output, pickle.HIGHEST_PROTOCOL)


#the following code can be run on another process as long as result
# contains the same information from above, its existence is managed by blazingSQL 
from blazingsql import BlazingContext
import pickle
with open('context.pkl', 'rb') as input:
  bc = pickle.load(input)
with open('result.pkl', 'rb') as input:
  result = pickle.load(input)

#Get result object
result = result.get()
#Create GDF from result object
result_gdf = result.columns

Disclaimer, I work for Blazing.

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

Disclaimer: I am an employee of NVIDIA and a contributor to RAPIDS.

We have a function that allows users to drag and drop a module through the UI interface to form a data processing pipeline, such as reading data, doing preprocessing, doing classification training, etc. After dragging/dropping, these modules will be executed sequentially.

Each module will start a container (via k8s) to run, the results processed by the previous module are saved to cephfs as a file, and the next module reads the file and then performs the operation. This serialization/deserialization process is slow. We plan to use RAPIDS to speed up this pipeline: to improve Inter-module data exchange by putting the data in the GPU MEM. And using cuDF/cuML instead of Pandas/SKLearn to get faster processing speed.

Currently, we have confirmed that these modules can be port from Pandas/SKLearn to cuDF/cuML, but because each module is running in a container, once the module finishes running, the container disappears and the process disappears too, so, the corresponding cuDF data cannot continue to exist in the GPU MEM.

In this case, if you want to use RAPIDS to improve the pipeline, is there any good advice?

Containers within a pod in Kubernetes share an IPC namespace [[1]][2,] which allows CUDA IPC to work between containers. What @flips presented above would be the ideal case for the most efficient data pipeline, but isn't easily accomplished with the current limitations in Kubernetes with GPUs. Allowing multiple containers to be launched that can talk to the same GPU(s) is still very much in the exploratory / experimental phase in Kubernetes [2]. This leaves you with a few options:

Try to use the workarounds / solutions presented in [2] to run a "sidecar" container that's long running whose sole purpose is to act as a middleman for GPU memory between different containers in the pod.
Copy the GPU memory objects to host memory objects such as Apache Arrow objects and then run a "sidecar" container without GPU resources and IPC memory to / from that container. For example using Arrow [3].
Continue using the cephfs filesystem as you are today, with GPU-accelerated output writers coming in the future and hope that it helps alleviate your current bottleneck.

How to use RAPIDS to speed up the modules separated by container, in a pipeline

2 Answers2