0

I am trying to make a good partition of the data in pyspark according to specific ID.

from pyspark.sql import HiveContext
mysqlContext = HiveContext(sc) 
query_1= f"SELECT /*+ COALESCE(100) */ {fields_Str} FROM database_customer.customer_table where customer_id in ({customer_id_Str}) " 

df_1 = mysqlContext.sql(query_1)

where fields_str contains the fields used. and customer_id_Str contains 500 unique customer_id

When I want to get the number of partition using this line of code:

df_1.rdd.getNumPartitions()

I get 100 which is good. But when I run the following code

df_1 = df_1.repartition(500,"customer_id")
df_1.show(30,truncate=False) 

It will take about 30 minutes of processing showing the following in the jobs page: JOb page

It shows that there is 6020 jobs running. and it takes 30 minutes. Even after the partition is done, when I re run df_1.show(30,truncate=False) it will take the same time.

Any idea why?

infinity911
  • 201
  • 1
  • 9
  • Please consider using coalesce when decreasing the nummer of partitions as repartition does a full data shuffle https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce – jrip Jun 06 '21 at 09:03
  • @user4126670 but the coelesce part is already done in the SQL query, no? – infinity911 Jun 06 '21 at 14:12

0 Answers0