Data frame re partition is not happening as expected

Question

I am running below code in spark to create table temp1 with number of parttion 200 . But while i am checking the actual number of partition by creating an rdd out of temp1 table its coming to be more than 200. How is this possible , am i missing any thing .It would be really helpful if any one can tell me ,if i am missing any thing !! Thanks

  val TransDataFrame = hiveContext.sql(
      s""" SELECT *
            FROM uacc.TRANS
            WHERE PROD_SURRO_ID != 0
            AND MONTH_ID >= 201401
            AND MONTH_ID <= 201403
            AND CRE_DT   <=  '2016-11-13'

         """).repartition(200,$"NDC").registerTempTable("temp")


   hiveContext.sql(
      s"""
          CREATE TABLE uacc.temp1
          AS SELECT * FROM temp
        """) 


val df = hiveContext.sql("SELECT * FROM uacc.temp1")
df.rdd.getNumPartitions
1224

score 0 · Answer 1 · edited May 23 '17 at 12:33

0

As you create the table uacc.temp1 you actually write your dataframe to hdfs, now as you load that table again, the number of partitions is controlled by the number of hdfs files (more specific: file splits), see How does partitioning work for data from files on HDFS?

edited May 23 '17 at 12:33

Community

1
1

answered Jan 13 '17 at 10:58

Raphael Roth

26,751
15
88
145

If you check TransDataFrame it will have 200 partitions . Your are writing it to hdfs loading back to new rdd which will give no.of partitions based on number of blocks hdfs used to save table. – Noman Khan Jan 13 '17 at 14:03

Data frame re partition is not happening as expected

1 Answers1