Apache spark shell : how to set the number of partitions?

Question

Apache spark shell context: how do you set the number of partitions when using the shell: it is not clear in the doc I am reviewing. Is just the default 2 partitions?

processing and transforming in parallel a large dataset. The default in standalone is number of cores. — , Sep 08 '18 at 22:33
The answer below concurs with my comment. Think you may need to redefine as it could be considered too broad. — thebluephantom, Sep 09 '18 at 02:39

Tomasz Krol · Answer 1 · 2018-09-04T23:29:06.420

But number of partitions for what? There are many different parameters in Spark (i.e for shuffling spark.sql.shuffle.partitions, spark.default.parallelism when you do transformation with RDDs) Also you can change number of partition for Dateset/Datafrem with COALESCE/REPARTITION etc...

There is also different default number of partitions for datasets when you work on your local PC or on hadoop cluster.

You need to specify what exactly you need to set for partitions?

Here are some good links, that could clarify your question more:

How does Spark partition(ing) work on files in HDFS?

Spark Partitions: Loading a file from the local file system on a Single Node Cluster

I have seen the default is the number of cores of the machine when working in standalone. I mean partitions for a map reduce operation. — , Sep 04 '18 at 23:04

Apache spark shell : how to set the number of partitions?

1 Answers1