How to improve performance my spark job here to load data into cassandra table?

Question

I am using spark-sql-2.4.1 ,spark-cassandra-connector_2.11-2.4.1 with java8 and apache cassandra 3.0 version.

I have my spark-submit or spark cluster enviroment as below to load 2 billion records.

--executor-cores 3 
--executor-memory 9g 
--num-executors 5 
--driver-cores 2 
--driver-memory 4g

I am using Cassandra 6 node cluster with below settings :

 cassandra.output.consistency.level=ANY
cassandra.concurrent.writes=1500
cassandra.output.batch.size.bytes=2056
cassandra.output.batch.grouping.key=partition 
cassandra.output.batch.grouping.buffer.size=3000
cassandra.output.throughput_mb_per_sec=128
cassandra.connection.keep_alive_ms=30000
cassandra.read.timeout_ms=600000

I am loading using spark dataframe into cassandra tables. After reading into spark data set I am grouping by on certain columns as below.

Dataset<Row> dataDf = //read data from source i.e. hdfs file which are already partitioned based "load_date", "fiscal_year" , "fiscal_quarter" , "id",  "type","type_code"

Dataset<Row> groupedDf = dataDf.groupBy("id","type","value" ,"load_date","fiscal_year","fiscal_quarter" , "create_user_txt", "create_date")



 groupedDf.write().format("org.apache.spark.sql.cassandra")
    .option("table","product")
    .option("keyspace", "dataload")
    .mode(SaveMode.Append)
    .save();

Cassandra table(
    PRIMARY KEY (( id, type, value, item_code ), load_date)
) WITH CLUSTERING ORDER BY ( load_date DESC )

Basically I am groupBy "id","type","value" ,"load_date" columns. As the other columns ( "fiscal_year","fiscal_quarter" , "create_user_txt", "create_date") should be available for storing into cassandra table I have to include them also in the groupBy clause.

1) Frankly speaking I dont know how to get those columns after groupBy into resultant dataframe i.e groupedDf to store. Any advice here to how to tackle this please ?

2) With above process/steps , my spark job for loading is pretty slow due to lot of shuffling i.e. read shuffle and write shuffle processes.

What should I do here to improve the speed ?

While reading from source (into dataDf) do I need to do anything here to improve performance? This is already partitioned.

Should I still need to do any partitioning ? If so , what is the best way/approach given the above cassandra table?

HDFS file columns

"id","type","value","type_code","load_date","item_code","fiscal_year","fiscal_quarter","create_date","last_update_date","create_user_txt","update_user_txt"

Pivoting

I am using groupBy due to pivoting as below

Dataset<Row> pivot_model_vals_unpersist_df =  model_vals_df.groupBy("id","type","value","type_code","load_date","item_code","fiscal_year","fiscal_quarter","create_date")
                .pivot("type_code" )
                .agg(  first(//business logic)
                )
              )

Please advice. Your advice/feedback are highly thankful.

I'm not really sure what you want to achieve here but I think you have some misunderstanding about `groupBy`. It is used to aggregate data, for example to sum all values that have the same id. If you want to partition the data depending on the column values you should use `repartition` or `partitionBy` (if writing directly), see e.g.: https://stackoverflow.com/questions/51569092/how-to-pass-multiple-column-in-partitionby-method-in-spark. — Shaido, Aug 28 '19 at 05:25
You can cache the dataframe before storing into Cassandra, did you try coalesce or repartition? — ashK, Aug 28 '19 at 08:30
https://stackoverflow.com/questions/29011574/how-does-spark-partitioning-work-on-files-in-hdfs this could help you.. please check with the partition, shuffling will be costly — ashK, Aug 28 '19 at 09:02
@BdLearner let me clarify. You have 2biliion rows in HDFS. You want 2biliion rows in Cassandra table with described schema. Am I right? Also, can you provide all the columns that are in your HDFS dataset? — Serge Harnyk, Aug 28 '19 at 21:07

score 1 · Answer 1 · answered Aug 29 '19 at 08:52

So, as I got from comments your task is next:

Take 2b rows from HDFS.
Save this rows into Cassandra with some conversion.
Schema of Cassandra table is not the same as schema of HDFS dataset.

At first, you definitely don't need group by. GROUP BY doesn't group columns, it group rows invoking some aggregate function like sum, avg, max, etc. Semantic is similar to SQL "group by", so it's no your case. What you really need - make your "to save" dataset fit into desired Cassandra schema.

In Java this is a little bit trickier than in Scala. At first I suggest to define a bean that would represent a Cassandra row.

public class MyClass {

   // Remember to declare no-args constructor
   public MyClass() { }

   private Long id;
   private String type;
   // another fields, getters, setters, etc
}

Your dataset is Dataset, you need to convert it into JavaRDD. So, you need a convertor.

public class MyClassFabric {
   public static MyClass fromRow(Row row) {
       MyClass myClass = new MyClass();
       myClass.setId(row.getInt("id"));
       // ....
       return myClass;
   }
}

In result we would have something like this:

JavaRDD<MyClass> rdd = dataDf.toJavaRDD().map(MyClassFabric::fromRow);
javaFunctions(rdd).writerBuilder("keyspace", "table", 
  mapToRow(MyClass.class)).saveToCassandra();

For additional info you can take a look https://github.com/datastax/spark-cassandra-connector/blob/master/doc/7_java_api.md

The issue is that your HDFS dataset is already partitioned in one way, but your Cassandra table is partitioned in another way. There is no logical way to avoid shuffling cause repartitioning is the sense of what your spark job is doing. — Serge Harnyk, Aug 29 '19 at 10:08
"Lets say if I partition hdfs by "colA" , "colB" , "colC" , and "colD" and cassandra table partitioned by "colA" , "colB" , "colC" , does it result in shuffle ?" Formally this is a different partition schemas, so it would lead to shuffle. Let assume, that colD has just two options "op1" and "op2". So, for data where colA =1, colB=1 and colC = 1 it would be two partitions, one for op1 and one for op2. And this partitions may be on different nodes. So, when you switch to partitioning {"colA" , "colB" , "colC"}, Spark should move "op2" data to "op1" data. — Serge Harnyk, Aug 29 '19 at 11:50

How to improve performance my spark job here to load data into cassandra table?

1 Answers1