0

I am using spark to read from database and write in hdfs as parquet file. Here is code snippet.

private long etlFunction(SparkSession spark){
    spark.sqlContext().setConf("spark.sql.parquet.compression.codec", "SNAPPY");    
    Properties properties = new Properties();
    properties.put("driver","oracle.jdbc.driver");
    properties.put("fetchSize","5000");     
    Dataset<Row> dataset = spark.read().jdbc(jdbcUrl, query, properties);
    dataset.write.format("parquet”).save("pdfs-path");  
    return dataset.count();
}

When I look at spark ui, during write I have stats of records written, visible in sql tab under query plan.

While the count itself is a heavy task.

Can someone suggest best way to get count in most optimized way.

To add, there is solution mentioned as duplicate, that involves counting using sparkListener. I am heavily reusing sparkSession, thus that would be much trickier to implement.

Thanks all..

rohit
  • 862
  • 12
  • 26
  • Possible duplicate of [Spark: how to get the number of written rows?](http://stackoverflow.com/questions/37496650/spark-how-to-get-the-number-of-written-rows) –  Nov 05 '16 at 14:54

1 Answers1

2

Parquet is really fast at counting so you can try return spark.sqlContext().read.parquet("path").count().

Mariusz
  • 13,481
  • 3
  • 60
  • 64
  • this will return a dataset, i am expecting count as long. – rohit Nov 05 '16 at 17:16
  • and how it would be different that doing a count as I am doing in my question. I am doing count on same dataset which I used to write in parquet. – rohit Nov 05 '16 at 17:42
  • Your count is performed on jdbc. Count on parquet uses data cached on hdfs and is pretty fast (probably count of rows is stored in parquet files metadata) – Mariusz Nov 05 '16 at 17:45