3

I am trying to connect Astra Cassandra in AWS EMR. but Executor are not able to get the bundle files as I am passing the file through S3.

this the spark submit command i passing.

--master yarn
--class com.proj.prog
--packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0,org.apache.hadoop:hadoop-aws:3.1.2
--conf spark.files=s3://.../connect/secure-connect-proj.zip
--conf spark.cassandra.connection.config.cloud.path=secure-connect-proj.zip

mode is cluster its working in client mode but not in cluster.

I also tried with but none worked.

--conf spark.cassandra.connection.config.cloud.path=s3://.../connect/secure-connect-proj.zip

This was the error in both cases.

diagnostics: User class threw exception: java.io.IOException: \
  Failed to open native connection to Cassandra \
  at Cloud File Based Config at secure-connect-proj.zip :: \
    The provided path secure-connect-proj.zip is not a valid URL \
    nor an existing locally path. Provide an URL accessible to all executors \
    or a path existing on all executors (you may use `spark.files` \
    to distribute a file to each executor).
Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times,  \
  most recent failure: Lost task 0.3 in stage 1.0 (TID 7) \
  (ip-172-31-17-85.ap-south-1.compute.internal executor 1): \
  java.io.IOException: Failed to open native connection to Cassandra \
  at Cloud File Based Config at s3://.../connect/secure-connect-proj.zip :: \
    The provided path s3://.../connect/secure-connect-proj.zip is not a valid URL \
    nor an existing locally path. Provide an URL accessible to all executors \
    or a path existing on all executors (you may use `spark.files` \
    to distribute a file to each executor).

Please help. I know I am missing something but I could not found a working solution.

  • Does this process has access to S3 bucket? By setting access keys, etc? – Alex Ott Aug 28 '21 at 20:32
  • Yes I passed s3 access key inside the program. `spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", "") spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "") spark.sparkContext.hadoopConfiguration.set("fs.s3a.fast.upload", "true")` this way – Rajendra Singh Aug 28 '21 at 20:50
  • I’m not sure (need to check), but you may try to pass them via Spark.conf, not Hadoop conf – Alex Ott Aug 28 '21 at 21:06

1 Answers1

1

S3 URI

It's not clear from the examples you provided where you have specified the correct S3 URI. Make sure that the URI is one of the following forms:

s3://bucket_name/secure-connect-db_name.zip
s3://bucket_name/subdir/secure-connect-db_name.zip
s3://bucket_name/path/to/secure-connect-db_name.zip

I would suggest you update your original question and replace s3://... with s3://bucket_name to avoid confusion.

IAM roles and EMR

EMR uses EMRFS to access S3 data so you need to configure IAM roles for EMRFS requests. EMRFS uses permission policies attached to the service role for EC2 instances.

If it isn't configured correctly, this could be the reason EMR can't access the secure bundle. For details, see Configure IAM roles for EMRFS requests to Amazon S3.

Compatibility

Make sure that you're using the correct version of the spark-cassandra-connector. Version 3.1 of the connector works with Spark 3.1 which means it will only work with Amazon EMR 6.3.

If you're using Amazon EMR 5.33, it has Spark 2.4 so you'll need to use version 2.5 of the connector.

Test with spark-shell

Test connectivity by running spark-shell so it's easier to isolate the problem.

Theses are the required dependencies to run the test:

libraryDependencies += "org.apache.spark" % "spark-sql" % "3.1.2"
libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector" % "3.1.0"

Start the spark-shell with:

spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:3.1.0 \
  --master {master-url} \
  --conf spark.files=s3://bucket_name/secure-connect-db_name.zip \
  --conf spark.cassandra.connection.config.cloud.path=secure-connect-db_name.zip \
  --conf spark.cassandra.auth.username=client_id \
  --conf spark.cassandra.auth.password=client_secret \
  --conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions

Finally, test the connection with:

import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("table_name", "keyspace_name").load
data.printSchema
data.show
Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
  • Thanks for the valuable info. Yes I am using the latest version of EMR Release also matched packages version with mvnrepository, Everything is working in local also in EMR with Client Mode but not in cluster mode. Yes I passing correct Object URI provided by S3. – Rajendra Singh Aug 29 '21 at 07:08
  • Have you done the isolation test using spark-shell? What was the result? Cheers! – Erick Ramirez Aug 29 '21 at 12:08
  • @RajendraSingh did you get a chance to get this to work? The reason it works in local is it can find the secure bundle in the local filesystem. When you deploy in cluster mode, you need to pass the S3 path using the `--files` (or `SparkContext.addFile`) so Spark will distribute the secure bundle to all workers/nodes. Cheers! – Erick Ramirez Sep 06 '21 at 02:32
  • @RajendraSingh I've updated my answer. Instead of `--files`, please try `--conf spark.files`. Please let me know how it goes. Cheers! – Erick Ramirez Sep 06 '21 at 22:30
  • Yes I tried in both ways but nothing worked. Thanks anyway I solved the problem by adding a Startup Script in EMR Cluster and directly copy the bundle file from S3 to Cluster using `aws copy` command and passed the local path of bundle file just copied from S3 in `--conf spark.cassandra.connection.config.cloud.path` flag. – Rajendra Singh Sep 07 '21 at 08:36