0

I’m trying to read data from Hive with Spark DF and distribute it into a specific configurable number of partitions (in a correlation to the number of cores). My job is pretty straightforward and it does not contain any joins or aggregations. I’ve read on the spark.sql.shuffle.partitions property but the documentation says:

Configures the number of partitions to use when shuffling data for joins or aggregations.

Does this mean that it would be irrelevant for me to configure this property? Or does the read operation is considered as a shuffle? If not, what is the alternative? Repartition and coalesce seems a bit like an overkill for that matter.

user7551211
  • 649
  • 1
  • 6
  • 25

2 Answers2

1

To verify my understanding of your problem, you want to increase number of partitions in your rdd/dataframe which is created immediately after reading data.

In this case the property you are after is spark.sql.files.maxPartitionBytes which controls the maximum data that can be pushed in a partition at max (please refer to https://spark.apache.org/docs/2.4.0/sql-performance-tuning.html) Default value is 128 MB which can be overridden to improve parallelism.

Vikas Saxena
  • 1,073
  • 1
  • 12
  • 21
  • Two notes that I want to clarify: 1. I read the data from Hive. The files property is still relevant?. 2. I don’t want to limit the number of partitions, I want to choose the exact number of partitions, so I’ll be able to absolutely manage the parallelism. – user7551211 Sep 21 '21 at 19:26
  • 1
    This property would work if you are using ```spark.read.xxx``` function for reading data irrespective for function xxx. In other words, it will work if you read any of the functions like ```spark.read.table```, ```spark.read.json``` etc. The partitions in a dataframe or rdd are not dependent on partitions on source and may not align with the source you are reading data from, for example, if you are doing a full table scan on a table with 10 partitions, your dataframe may have more than 10 partitions. – Vikas Saxena Sep 22 '21 at 05:10
  • There is no way of deciding number of partitions for a dataframe at run time that I am aware of, older versions of spark had that option but it would be a static vlaue than a dynamic one. Please refer to this link for the same https://stackoverflow.com/questions/39368516/number-of-partitions-of-spark-dataframe – Vikas Saxena Sep 22 '21 at 05:13
  • I did meant a static value. Like the number I’d put on the spark.sql.shuffle.partitions property, but for reading. – user7551211 Sep 22 '21 at 06:07
1

Read is not a shuffle as such. You need to get the data in at some stage.

The answer below can be used or an algorithm by Spark sets the number of partitions upon a read.

You do not state if you are using RDD or DF. With RDD you can set num partitions. With DF you need to repartition after read in general.

Your point on controlling parallelism is less relevant when joining or aggregating as you note.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • So you’re saying that if my job only reads and writes then the spark.sql.shuffle.partitions won’t affect at all? I would read my data into 200 partitions even if I’ll configure this property to 50 for example? – user7551211 Sep 22 '21 at 15:37
  • Afaik yes. Shuffle...is for joins, aggr – thebluephantom Sep 22 '21 at 16:53