1

This is my spark submit conf

--master yarn        \
--deploy-mode cluster \
--driver-cores 2     \
--driver-memory 10G   \
--num-executors 100           \
--executor-cores 4           \
--executor-memory 8G         \

This is some of my scala codes

val preDF = sc
        .textFile(hdfspath)   // this path contains about 400 gz files each files about 100MB
        .repartition(400)
        .map(line => tmpFormat(line))  //  case class tmpFormat(data0: String)
        .toDF()

      preWhoisDataDF.createOrReplaceTempView("pre_tmp_view")

      val tempDF =
        spark
          .sql(
          s"""
             |select
             |          concat_ws("\n",collect_set(data0)) as data1
             |      from
             |         (select
             |             data0, Row_Number() OVER (ORDER BY data0) as rn
             |            from pre_tmp_view
             |          ) a
             |      group by ceil(rn/2000)
             |
        """.stripMargin)

      tempDF.createOrReplaceTempView("temp_view")

pre_tmp_view like this 
+-------+
| data0 |
+-------+
|  {}   |
+-------+
|  {}   |
+-------+
|  {}   |
+-------+
|  {}   |
+-------+
|  {}   |
+-------+
  ······
+-------+
|  {}   |
+-------+


and temp_view  like this
+---------------------------+
|           data1           |
+---------------------------+
|  {}\n{}\n{}\n{}\n ... {}  |
+---------------------------+
            ······
+---------------------------+
|  {}\n{}\n{}\n{}\n ... {}  |
+---------------------------+

When my scala code runs this spark sql, this job just have one running task, shouldn’t it be running 400 tasks like RDD preDF, I tried repartition this sql but it not works, so this is because skew or shuffle ? and how should I change this sql code, who can help me to solve this problem, I'm a cookie in spark and spark-sql, Thanks!

thebluephantom
  • 16,458
  • 8
  • 40
  • 83

0 Answers0