0

I just set up a Spark cluster in Google Cloud using DataProc and I have a standalone installation of Cassandra running on a separate VM. I would like to install the Datastax spark-cassandra connector so I can connect to Cassandra from spark. How can I do this ?

The connector can be downloaded here:

https://github.com/datastax/spark-cassandra-connector

The instructions on building are here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/12_building_and_artifacts.md

sbt is needed to build it.

Where can I find sbt for the DataProc installation ?

Would it be under $SPARK_HOME/bin ? Where is spark installed for DataProc ?

Ismail
  • 1,068
  • 1
  • 6
  • 11
femibyte
  • 3,317
  • 7
  • 34
  • 59
  • Does the connector need to be installed on the entire cluster, or could it be used via spark packages (which admittedly require a bit of a hack to use on Dataproc)? If packages are sufficient, consider using the 'short answer' on this question: http://stackoverflow.com/questions/33363189/use-an-external-library-in-pyspark-job-in-a-spark-cluster-from-google-dataproc – Angus Davis Dec 29 '15 at 20:52

2 Answers2

0

I'm going to follow up the really helpful comment @angus-davis made not too long ago.

Where can I find sbt for the DataProc installation ?

At present, sbt is not included on Cloud Dataproc clusters. The sbt documentation contains information on how to install sbt manually. If you need to re-install sbt on your clusters, I highly recommend you create an init action to install sbt when you create a cluster. After some research, it looks like SBT is covered under a BSD-3 license, which means we can probably (no promise) include it in Cloud Dataproc clusters.

Would it be under $SPARK_HOME/bin ? Where is spark installed for DataProc ?

The answer to this is it depends on what you mean.

  • binaries - /usr/bin
  • config - /etc/spark/conf
  • spark_home - /usr/lib/spark

Importantly, this same pattern is used for other major OSS components installed on Cloud Dataproc clusters, like Hadoop and Hive.

I would like to install the Datastax spark-cassandra connector so I can connect to Cassandra from spark. How can I do this ?

The Stack Overflow answer Angus sent is probably the easiest way if it can be used as a Spark package. Based on what I can find, however, this is probably not an option. This means you will need to install sbt and manually install.

Community
  • 1
  • 1
James
  • 2,321
  • 14
  • 30
0

You can use cassandra along with the mentioned jar and connector from datastax. You can simply download the jar and pass it to dataproc cluster. You can find Google provided template, I contributed to, in this link [1]. This explains how you can use the template to connect to Cassandra using Dataproc.

Anish Sarangi
  • 172
  • 1
  • 14