sql order by does not work when spark connect to kylin via jdbc

Question

We use apache spark to query data from apache kylin via jdbc,we write a sql with a order by clause,but sometimes it does not work.

Briefly,there is a table called shop_dim in kylin,a field SHOP_ID is not in descending order,I use spark jdbc to query sql select * from shop_dim order by shop_id desc to get a dataframe,If I select all the columns,order by not work,but if I select only a few fields,it works.

By the way,I have a guess:if the kylin table is dimension,this problem occurs,if the table is measure,it is ok.

It is the details below:

When I run query select * from shop_dim on the kylin page of web browser (http://someip:7070/kylin/query#query_content_results) The result is:

+---------------+---------------+-------+
|      SHOP_NAME|SHOP_COMPANY_ID|SHOP_ID|
+---------------+---------------+-------+
|            ...|            839|      1|
|           ....|            264|      2|
|           ....|            790|      3|
|           ....|            199|      4|
|           ....|            692|      5|
|           ....|            673|      6|
|           ....|            647|      7|
|           ....|            753|      8|
|           ....|            711|      9|
|           ....|           3090|    600|
|           ....|           3091|    601|
|           ....|           3088|    602|
|           ....|           3097|    603|
|           ....|           3100|    604|
|           ....|           3106|    605|
|           ....|           3102|    606|
|           ....|           3104|    607|
|           ....|           3110|    608|
|           ....|           3191|    609|
|           ....|           3171|    610|
...    
...    
|           ....|           2046|    500|
+---------------+---------------+-------+

If I query select * from shop_dim order by shop_id desc on Kylin web, the result is:

+---------------+---------------+-------+
|      SHOP_NAME|SHOP_COMPANY_ID|SHOP_ID|
+---------------+---------------+-------+
|           ....|           3184|    617|
|           ....|           3185|    616|
|           ....|           3175|    615|
|           ....|           3190|    614|
|           ....|           3183|    613|
|           ....|           3174|    612|
|           ....|           3186|    611|
|           ....|           3171|    610|
|           ....|           3191|    609|
|           ....|           3110|    608|
|           ....|           3104|    607|
|           ....|           3102|    606|
|           ....|           3106|    605|
|           ....|           3100|    604|
|           ....|           3097|    603|
|           ....|           3088|    602|
|           ....|           3091|    601|
|           ....|           3090|    600|
|           ....|           3076|    599|
|           ....|           3089|    598|
+---------------+---------------+-------+

Then I try to use spark:

object test { 
   val conf =new SparkConf()
   conf.setMaster("local")
   val spark=SparkSession.builder().config(conf).getOrCreate() 
   def main(args: Array[String]): Unit = { 
      val url="jdbc:kylin://someip:7070/fuyun"
      val tab_name = s"(select * from shop_dim order by shop_id desc) the_alias"   
      val df = spark.read.format("jdbc").option("url",url).option("user", "admin")
      .option("password", "KYLIN").option("driver", "org.apache.kylin.jdbc.Driver")
      .option("dbtable", tab_name).load()

      df.select("SHOP_ID").show()
      df.select("SHOP_NAME","SHOP_ID").show()
      df.select("SHOP_NAME","SHOP_COMPANY_ID","SHOP_ID").show()
      df.select("SHOP_NAME","SHOP_COMPANY_ID","SHOP_ID").orderBy(desc("SHOP_ID")).show()
  }
}

There are 4 show on the bottom,first two works as expected,the third not,the fourth is in right order.

the result of df.select("SHOP_ID").show()

+-------+
|SHOP_ID|
+-------+
|    617|
|    616|
|    615|
|    614|
|    613|
|    612|
|    611|
|    610|
|    609|
|    608|
|    607|
|    606|
|    605|
|    604|
|    603|
|    602|
|    601|
|    600|
|    599|
|    598|
+-------+

the order is correct.

the result of df.select("SHOP_NAME","SHOP_ID").show()

+---------------+-------+
|      SHOP_NAME|SHOP_ID|
+---------------+-------+
|           ....|    617|
|           ....|    616|
|           ....|    615|
|           ....|    614|
|           ....|    613|
|           ....|    612|
|           ....|    611|
|           ....|    610|
|           ....|    609|
|           ....|    608|
|           ....|    607|
|           ....|    606|
|           ....|    605|
|           ....|    604|
|           ....|    603|
|           ....|    602|
|           ....|    601|
|           ....|    600|
|           ....|    599|
|           ....|    598|
+---------------+-------+

It is all right too.

But the result of df.select("SHOP_NAME","SHOP_COMPANY_ID","SHOP_ID").show()

+---------------+---------------+-------+
|      SHOP_NAME|SHOP_COMPANY_ID|SHOP_ID|
+---------------+---------------+-------+
|           ....|            839|      1|
|           ....|            264|      2|
|           ....|            790|      3|
|           ....|            199|      4|
|           ....|            692|      5|
|           ....|            673|      6|
|           ....|            647|      7|
|           ....|            753|      8|
|           ....|            711|      9|
|           ....|           3090|    600|
|           ....|           3091|    601|
|           ....|           3088|    602|
|           ....|           3097|    603|
|           ....|           3100|    604|
|           ....|           3106|    605|
|           ....|           3102|    606|
|           ....|           3104|    607|
|           ....|           3110|    608|
|           ....|           3191|    609|
|           ....|           3171|    610|
+---------------+---------------+-------+

The order is not descending now,seems it is the same as the original order in kylin.

So I must add another .orderBy(desc("SHOP_ID")) to make it work:

df.select("SHOP_NAME","SHOP_COMPANY_ID","SHOP_ID").orderBy(desc("SHOP_ID")).show()

the result is:

+---------------+---------------+-------+
|      SHOP_NAME|SHOP_COMPANY_ID|SHOP_ID|
+---------------+---------------+-------+
|           ....|           3184|    617|
|           ....|           3185|    616|
|           ....|           3175|    615|
|           ....|           3190|    614|
|           ....|           3183|    613|
|           ....|           3174|    612|
|           ....|           3186|    611|
|           ....|           3171|    610|
|           ....|           3191|    609|
|           ....|           3110|    608|
|           ....|           3104|    607|
|           ....|           3102|    606|
|           ....|           3106|    605|
|           ....|           3100|    604|
|           ....|           3097|    603|
|           ....|           3088|    602|
|           ....|           3091|    601|
|           ....|           3090|    600|
|           ....|           3076|    599|
|           ....|           3089|    598|
+---------------+---------------+-------+

That is what I expect,but I am writing a common function which will be called by others,I want others just give a sql as a parameter,but not need to write the additional code .orderBy(desc("XXX"))

So why the order by in the kylin sql using spark does not work sometimes?

This is the expected behavior for any db system not only kylin. Spark will populate the data in parallel so from N threads, when merging the sub-parts there is no guarantee that this will be sorted. Therefore you must specify ordering explicitly after retrieving data from an external data source. — abiratsis, May 14 '19 at 11:21
@AlexandrosBiratsis In the above,I use `conf.setMaster("local")`,but if I use `conf.setMaster("local[1]")`,order is not guaranteed too,why?And If I put it to cluster and use `conf.setMaster("yarn")` ,can it be guaranteed? — hometown, May 15 '19 at 01:12
because I dont know if there is such a driver that guarantees the order by default! Have you seen such an implementation because I haven't :) Even if you know how the imported data will be stored from the client side still you can't guarantee that server will serve the data in a specific order therefore you must usually process/modify it afterwards. — abiratsis, May 15 '19 at 08:21

score 0 · Answer 1 · answered May 17 '19 at 06:44

I agree with @Alexandros Biratsis

The code df.select("SHOP_NAME","SHOP_COMPANY_ID","SHOP_ID").show() will query kylin with sql like

SELECT "SHOP_NAME","SHOP_COMPANY_ID","SHOP_ID" FROM (select * from shop_dim order by COUNTRY desc) the_alias

You can try run this sql on Kylin's web, the result should be ordered.

sql order by does not work when spark connect to kylin via jdbc

1 Answers1