This is my spark submit conf
--master yarn \
--deploy-mode cluster \
--driver-cores 2 \
--driver-memory 10G \
--num-executors 100 \
--executor-cores 4 \
--executor-memory 8G \
This is some of my scala codes
val preDF = sc
.textFile(hdfspath) // this path contains about 400 gz files each files about 100MB
.repartition(400)
.map(line => tmpFormat(line)) // case class tmpFormat(data0: String)
.toDF()
preWhoisDataDF.createOrReplaceTempView("pre_tmp_view")
val tempDF =
spark
.sql(
s"""
|select
| concat_ws("\n",collect_set(data0)) as data1
| from
| (select
| data0, Row_Number() OVER (ORDER BY data0) as rn
| from pre_tmp_view
| ) a
| group by ceil(rn/2000)
|
""".stripMargin)
tempDF.createOrReplaceTempView("temp_view")
pre_tmp_view like this
+-------+
| data0 |
+-------+
| {} |
+-------+
| {} |
+-------+
| {} |
+-------+
| {} |
+-------+
| {} |
+-------+
······
+-------+
| {} |
+-------+
and temp_view like this
+---------------------------+
| data1 |
+---------------------------+
| {}\n{}\n{}\n{}\n ... {} |
+---------------------------+
······
+---------------------------+
| {}\n{}\n{}\n{}\n ... {} |
+---------------------------+
When my scala code runs this spark sql, this job just have one running task, shouldn’t it be running 400 tasks like RDD preDF, I tried repartition this sql but it not works, so this is because skew or shuffle ? and how should I change this sql code, who can help me to solve this problem, I'm a cookie in spark and spark-sql, Thanks!