0

I just wanted to understand why the spark repartition increase data volume ?. When the same operation I did with Coalesce , It showed me correct size. When I did repartition with 100GB of data It became around 400GB (more than that).

Here is my code which do repartition

SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("spark compatation");
        JavaSparkContext sc = new JavaSparkContext(conf);
        SparkSession spark = SparkSession.builder().config(conf).getOrCreate(); 
        String partition = "file='hit_data'";
        spark.read()
          .format("delta")
          .load("delta-table/clickstream/")
         // .where(partition)
          .repartition(10)
          .write()
          .format("delta")
          .mode("overwrite")
          //.option("replaceWhere", partition)
          .save("delta-table/clickstream/");

        spark.stop();
        sc.close();

0 Answers0