3

I am currently trying to run genomic analyses pipelines using Hail(library for genomics analyses written in python and Scala). Recently, Apache Spark 3 was released and it supported GPU usage.

I tried spark-rapids library start an on-premise slurm cluster with gpu nodes. I was able to initialise the cluster. However, when I tried running hail tasks, the executors keep getting killed.

On querying in Hail forum, I got the response that

That’s a GPU code generator for Spark-SQL, and Hail doesn’t use any Spark-SQL interfaces, only the RDD interfaces.

So, does Spark3 not support GPU usage for RDD interfaces?

  • 2
    Apache Spark 3.0+ lets users provide a plugin that can replace the backend for SQL and DataFrame operations...Note that the plugin cannot accelerate operations that manipulate RDDs directly. See https://nvidia.github.io/spark-rapids/Getting-Started/#getting-started-with-the-rapids-accelerator-for-apache-spark for more information. – Nick Becker Sep 21 '21 at 17:58
  • Thanks for the quick reply. Is there any other library or plugin which can allow manipulating RDDs directly or allow GPU usage for RDD interfaces? – Abhishek Shakya Sep 21 '21 at 18:03
  • If you have a gpu enabled spark, maybe you only have to change your code from rdd to dataframe. https://stackoverflow.com/questions/29383578/how-to-convert-rdd-object-to-dataframe-in-spark – aironman Sep 21 '21 at 20:01

1 Answers1

0

As of now, spark-rapids doesn't support GPU usage for RDD interfaces.

Source: Link

Apache Spark 3.0+ lets users provide a plugin that can replace the backend for SQL and DataFrame operations. This requires no API changes from the user. The plugin will replace SQL operations it supports with GPU accelerated versions. If an operation is not supported it will fall back to using the Spark CPU version. Note that the plugin cannot accelerate operations that manipulate RDDs directly.

Here, an answer from spark-rapids team

Source: Link

We do not support running the RDD API on GPUs at this time. We only support the SQL/Dataframe API, and even then only a subset of the operators. This is because we are translating individual Catalyst operators into GPU enabled equivalent operators. I would love to be able to support the RDD API, but that would require us to be able to take arbitrary java, scala, and python code and run it on the GPU. We are investigating ways to try to accomplish some of this, but right now it is very difficult to do. That is especially true for libraries like Hail, which use python as an API, but the data analysis is done in C/C++.