1

Spark makes the logical partitions with in RDD. I have two questions on it :-

1) Everywhere on Google it is said that partition helps in parallel processing where each partition can be processed on separate node. My question is if i have multi core machine, can not i process the partition in same node ?

2) Say I read file from file system and spark created one RDD with four partition. Now can each partition be divided further to RDD ? For Example :-

 firstRDD=sc.textFile("hdfs://...")
 //firstRDD contains four partition which are processed on four diff nodes
 secondRDD=firstRDD.filter(someFunction);
// Now will each node create separate secondRDD  which will have further paritions ?
Garren S
  • 5,552
  • 3
  • 30
  • 45
scott miles
  • 1,511
  • 2
  • 21
  • 36

1 Answers1

1

An input text file split into 4 partitions which may be within a single node or up to 4 nodes will not be split into more partitions and will thus be evaluated by the same executor that read them in initially. However, you may repartition the RDD/data frame to increase parallelization (such as having 64 partitions for your 64 executors). This will force a shuffle which can be costly but worthwhile particularly in computationally expensive work. A common situation where this is a problem is reading in unsplittable files like GZIP files. A single executor has to read in the file (and do the processing!!) regardless of size. Thus repartitioning it is hugely beneficial for many GZIP workloads because it facilitates parallelized computation.

Garren S
  • 5,552
  • 3
  • 30
  • 45
  • you said `..will thus be evaluated by the same executor that read them in initially` If 4 parttions are processed on 4 nodes, there will 4 executors not 1 Right ? Also when you `However, you may repartition..` do you mean once a partition is created in RDD it will not be further partitioned by default until we explicitly do it ? – scott miles Jun 04 '17 at 07:35
  • Yes if 4 nodes read in 1 partition each that implies 4 executors. The data in one partition can be split into more partitions depending on the transformations (namely aggregations). For example reading in one big gzip file, adding a few columns and sanitizing the data without aggregations then writing it out would give a single executor the entire workload unless you explicitly tell it to repartition. – Garren S Jun 04 '17 at 15:22
  • https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-partitions.html – scott miles Jun 05 '17 at 03:31
  • @scottmiles Jacek is a great source and that book is chalk full of goodies – Garren S Jun 05 '17 at 04:58
  • Can you please have a look at https://stackoverflow.com/questions/44361540/analysing-log-file-with-spark ? – scott miles Jun 06 '17 at 02:19