I am using spark-sql-2.4.1 ,spark-cassandra-connector_2.11-2.4.1 with java8 and apache cassandra 3.0 version.
I have my spark-submit or spark cluster enviroment as below to load 2 billion records.
--executor-cores 3
--executor-memory 9g
--num-executors 5
--driver-cores 2
--driver-memory 4g
I am using Cassandra 6 node cluster with below settings :
cassandra.output.consistency.level=ANY
cassandra.concurrent.writes=1500
cassandra.output.batch.size.bytes=2056
cassandra.output.batch.grouping.key=partition
cassandra.output.batch.grouping.buffer.size=3000
cassandra.output.throughput_mb_per_sec=128
cassandra.connection.keep_alive_ms=30000
cassandra.read.timeout_ms=600000
I am loading using spark dataframe into cassandra tables. After reading into spark data set I am grouping by on certain columns as below.
Dataset<Row> dataDf = //read data from source i.e. hdfs file which are already partitioned based "load_date", "fiscal_year" , "fiscal_quarter" , "id", "type","type_code"
Dataset<Row> groupedDf = dataDf.groupBy("id","type","value" ,"load_date","fiscal_year","fiscal_quarter" , "create_user_txt", "create_date")
groupedDf.write().format("org.apache.spark.sql.cassandra")
.option("table","product")
.option("keyspace", "dataload")
.mode(SaveMode.Append)
.save();
Cassandra table(
PRIMARY KEY (( id, type, value, item_code ), load_date)
) WITH CLUSTERING ORDER BY ( load_date DESC )
Basically I am groupBy "id","type","value" ,"load_date" columns. As the other columns ( "fiscal_year","fiscal_quarter" , "create_user_txt", "create_date") should be available for storing into cassandra table I have to include them also in the groupBy clause.
1) Frankly speaking I dont know how to get those columns after groupBy into resultant dataframe i.e groupedDf to store. Any advice here to how to tackle this please ?
2) With above process/steps , my spark job for loading is pretty slow due to lot of shuffling i.e. read shuffle and write shuffle processes.
What should I do here to improve the speed ?
While reading from source (into dataDf) do I need to do anything here to improve performance? This is already partitioned.
Should I still need to do any partitioning ? If so , what is the best way/approach given the above cassandra table?
HDFS file columns
"id","type","value","type_code","load_date","item_code","fiscal_year","fiscal_quarter","create_date","last_update_date","create_user_txt","update_user_txt"
Pivoting
I am using groupBy due to pivoting as below
Dataset<Row> pivot_model_vals_unpersist_df = model_vals_df.groupBy("id","type","value","type_code","load_date","item_code","fiscal_year","fiscal_quarter","create_date")
.pivot("type_code" )
.agg( first(//business logic)
)
)
Please advice. Your advice/feedback are highly thankful.