5

I have question on spark dataframe number of partitions.

If I have Hive table(employee) which has columns (name,age,id,location).

CREATE TABLE employee (name String, age String, id Int) PARTITIONED BY (location String);

If the employee table has 10 different locations. So data will be partitioned into 10 partitions in HDFS.

If I create a Spark dataframe(df) by reading the whole data of a Hive table(employee).

How many number of partitions will be created by Spark for a dataframe(df)?

df.rdd.partitions.size = ??

Sri
  • 623
  • 3
  • 9
  • 22

1 Answers1

1

Partitions are created depending on the block size of HDFS.

Imagine you have read the 10 partitions as a single RDD and if the block size is 128MB then

no of partitions = (size of(10 partitions in MBs)) / 128MB

will be stored on HDFS.

Please refer to the following link:

http://www.bigsynapse.com/spark-input-output

sk79
  • 35
  • 10
  • Yes I am aware of the fact that if the number of blocks of the file in HDFS is 10 Blocks.(example if 64MB is block size in HDFS and if the file size is 640MB). In this case when a Spark rdd is created by reading this file from HDFS. Spark rdd will have 10 partitions. – Sri May 10 '17 at 13:22
  • 2
    But I am talking about the Hive table which is partitioned by one column. Will this be a driving factor for Spark to decide the number of partitions of a dataframe which is created by reading that hive table ? – Sri May 10 '17 at 13:25
  • I don't think this answer is always correct , because when I created a dataframe with an hive external table , the number of partitions was 119 .The table was a hive partition table with 150 partfiles in it , with a minimum size of file is 30 mb and max size is 118 mb – dileepVikram Apr 02 '20 at 12:37
  • [link](https://stackoverflow.com/a/51877075/4652536) this answer maybe more complete – yuxh Mar 22 '21 at 09:54