1

I am new in Spark and have a question

Are the more partitions the better in Spark ? If I have OOM issue, more partitions helps ?

qingpan
  • 406
  • 1
  • 4
  • 14
  • Possible duplicate of [Number of partitions in RDD and performance in Spark](http://stackoverflow.com/questions/35800795/number-of-partitions-in-rdd-and-performance-in-spark) – WestCoastProjects Jun 16 '16 at 20:23

1 Answers1

1

Partitions determine the degree of parallelism. Apache Spark doc says that, the partitions size should be atleast equal to the number of cores in the cluster.

In case of very few partitions, all the cores in the cluster would not be utilized. If there are too many partitions and data is small, then too many small tasks gets scheduled.

If your getting the out of memory issue, you would have to increase the executor memory. It should be a minimum of 8GB.

Dazzler
  • 807
  • 9
  • 11
  • 1
    I would add that partitions are optimized to be for a partition about [128 mb, which is the dafault](http://www.bigsynapse.com/spark-input-output). – Katya Willard Jun 16 '16 at 19:48
  • **@Dazzler**, **@Katya Handler** if I'm reading data from `MySQL` (`JDBC`), then the resulting `DataFrame` has as many `partition`s as the *degree of parallelism* of *read task* (determined by [`numPartitions` parameter](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader@jdbc(url:String,table:String,columnName:String,lowerBound:Long,upperBound:Long,numPartitions:Int,connectionProperties:java.util.Properties):org.apache.spark.sql.DataFrame)). In that case, how do I control *size of `partition`s*? – y2k-shubham Mar 23 '18 at 11:28
  • Please see [this comment](https://stackoverflow.com/questions/36009392/spark-is-there-any-rule-of-thumb-about-the-optimal-number-of-partition-of-a-rdd/36020027#comment85897834_36020027) for more details – y2k-shubham Mar 23 '18 at 11:28