I am using spark to read from database and write in hdfs as parquet file. Here is code snippet.
private long etlFunction(SparkSession spark){
spark.sqlContext().setConf("spark.sql.parquet.compression.codec", "SNAPPY");
Properties properties = new Properties();
properties.put("driver","oracle.jdbc.driver");
properties.put("fetchSize","5000");
Dataset<Row> dataset = spark.read().jdbc(jdbcUrl, query, properties);
dataset.write.format("parquet”).save("pdfs-path");
return dataset.count();
}
When I look at spark ui, during write I have stats of records written, visible in sql tab under query plan.
While the count itself is a heavy task.
Can someone suggest best way to get count in most optimized way.
To add, there is solution mentioned as duplicate, that involves counting using sparkListener. I am heavily reusing sparkSession, thus that would be much trickier to implement.
Thanks all..