Why spark repartition increase size (data volume)

Asked Nov 26 '19 at 06:50

Active Nov 26 '19 at 08:52

Viewed 285 times

I just wanted to understand why the spark repartition increase data volume ?. When the same operation I did with Coalesce , It showed me correct size. When I did repartition with 100GB of data It became around 400GB (more than that).

Here is my code which do repartition

SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("spark compatation");
        JavaSparkContext sc = new JavaSparkContext(conf);
        SparkSession spark = SparkSession.builder().config(conf).getOrCreate(); 
        String partition = "file='hit_data'";
        spark.read()
          .format("delta")
          .load("delta-table/clickstream/")
         // .where(partition)
          .repartition(10)
          .write()
          .format("delta")
          .mode("overwrite")
          //.option("replaceWhere", partition)
          .save("delta-table/clickstream/");

        spark.stop();
        sc.close();

edited Nov 26 '19 at 08:52

asked Nov 26 '19 at 06:50

Sanjiv Kumar

2

What is the partition count before and after transformation? – shuvalov Nov 26 '19 at 07:04
1

How do you measure 100 and 400GB? – Vladislav Varslavans Nov 26 '19 at 08:20
more details are required on how did you performed your check... include ur code if possible... – Balaji Reddy Nov 26 '19 at 08:34
@VladislavVarslavans , i have checked the size in AWS, before repartition and after repartition. – Sanjiv Kumar Nov 26 '19 at 08:50
@BalajiReddy code has beed added. – Sanjiv Kumar Nov 26 '19 at 08:52
@SanjivKumar you can take a look at this post : https://stackoverflow.com/questions/38153935/why-are-spark-parquet-files-for-an-aggregate-larger-than-the-original – baitmbarek Nov 26 '19 at 10:28

Why spark repartition increase size (data volume)

0 Answers0

Linked