2

We use apache spark to query data from apache kylin via jdbc,we write a sql with a order by clause,but sometimes it does not work.

Briefly,there is a table called shop_dim in kylin,a field SHOP_ID is not in descending order,I use spark jdbc to query sql select * from shop_dim order by shop_id desc to get a dataframe,If I select all the columns,order by not work,but if I select only a few fields,it works.

By the way,I have a guess:if the kylin table is dimension,this problem occurs,if the table is measure,it is ok.

It is the details below:

When I run query select * from shop_dim on the kylin page of web browser (http://someip:7070/kylin/query#query_content_results) The result is:

+---------------+---------------+-------+
|      SHOP_NAME|SHOP_COMPANY_ID|SHOP_ID|
+---------------+---------------+-------+
|            ...|            839|      1|
|           ....|            264|      2|
|           ....|            790|      3|
|           ....|            199|      4|
|           ....|            692|      5|
|           ....|            673|      6|
|           ....|            647|      7|
|           ....|            753|      8|
|           ....|            711|      9|
|           ....|           3090|    600|
|           ....|           3091|    601|
|           ....|           3088|    602|
|           ....|           3097|    603|
|           ....|           3100|    604|
|           ....|           3106|    605|
|           ....|           3102|    606|
|           ....|           3104|    607|
|           ....|           3110|    608|
|           ....|           3191|    609|
|           ....|           3171|    610|
...    
...    
|           ....|           2046|    500|
+---------------+---------------+-------+

If I query select * from shop_dim order by shop_id desc on Kylin web, the result is:

+---------------+---------------+-------+
|      SHOP_NAME|SHOP_COMPANY_ID|SHOP_ID|
+---------------+---------------+-------+
|           ....|           3184|    617|
|           ....|           3185|    616|
|           ....|           3175|    615|
|           ....|           3190|    614|
|           ....|           3183|    613|
|           ....|           3174|    612|
|           ....|           3186|    611|
|           ....|           3171|    610|
|           ....|           3191|    609|
|           ....|           3110|    608|
|           ....|           3104|    607|
|           ....|           3102|    606|
|           ....|           3106|    605|
|           ....|           3100|    604|
|           ....|           3097|    603|
|           ....|           3088|    602|
|           ....|           3091|    601|
|           ....|           3090|    600|
|           ....|           3076|    599|
|           ....|           3089|    598|
+---------------+---------------+-------+

Then I try to use spark:

object test { 
   val conf =new SparkConf()
   conf.setMaster("local")
   val spark=SparkSession.builder().config(conf).getOrCreate() 
   def main(args: Array[String]): Unit = { 
      val url="jdbc:kylin://someip:7070/fuyun"
      val tab_name = s"(select * from shop_dim order by shop_id desc) the_alias"   
      val df = spark.read.format("jdbc").option("url",url).option("user", "admin")
      .option("password", "KYLIN").option("driver", "org.apache.kylin.jdbc.Driver")
      .option("dbtable", tab_name).load()

      df.select("SHOP_ID").show()
      df.select("SHOP_NAME","SHOP_ID").show()
      df.select("SHOP_NAME","SHOP_COMPANY_ID","SHOP_ID").show()
      df.select("SHOP_NAME","SHOP_COMPANY_ID","SHOP_ID").orderBy(desc("SHOP_ID")).show()
  }
}

There are 4 show on the bottom,first two works as expected,the third not,the fourth is in right order.

the result of df.select("SHOP_ID").show()

+-------+
|SHOP_ID|
+-------+
|    617|
|    616|
|    615|
|    614|
|    613|
|    612|
|    611|
|    610|
|    609|
|    608|
|    607|
|    606|
|    605|
|    604|
|    603|
|    602|
|    601|
|    600|
|    599|
|    598|
+-------+

the order is correct.

the result of df.select("SHOP_NAME","SHOP_ID").show()

+---------------+-------+
|      SHOP_NAME|SHOP_ID|
+---------------+-------+
|           ....|    617|
|           ....|    616|
|           ....|    615|
|           ....|    614|
|           ....|    613|
|           ....|    612|
|           ....|    611|
|           ....|    610|
|           ....|    609|
|           ....|    608|
|           ....|    607|
|           ....|    606|
|           ....|    605|
|           ....|    604|
|           ....|    603|
|           ....|    602|
|           ....|    601|
|           ....|    600|
|           ....|    599|
|           ....|    598|
+---------------+-------+

It is all right too.

But the result of df.select("SHOP_NAME","SHOP_COMPANY_ID","SHOP_ID").show()

+---------------+---------------+-------+
|      SHOP_NAME|SHOP_COMPANY_ID|SHOP_ID|
+---------------+---------------+-------+
|           ....|            839|      1|
|           ....|            264|      2|
|           ....|            790|      3|
|           ....|            199|      4|
|           ....|            692|      5|
|           ....|            673|      6|
|           ....|            647|      7|
|           ....|            753|      8|
|           ....|            711|      9|
|           ....|           3090|    600|
|           ....|           3091|    601|
|           ....|           3088|    602|
|           ....|           3097|    603|
|           ....|           3100|    604|
|           ....|           3106|    605|
|           ....|           3102|    606|
|           ....|           3104|    607|
|           ....|           3110|    608|
|           ....|           3191|    609|
|           ....|           3171|    610|
+---------------+---------------+-------+

The order is not descending now,seems it is the same as the original order in kylin.

So I must add another .orderBy(desc("SHOP_ID")) to make it work:

df.select("SHOP_NAME","SHOP_COMPANY_ID","SHOP_ID").orderBy(desc("SHOP_ID")).show()

the result is:

+---------------+---------------+-------+
|      SHOP_NAME|SHOP_COMPANY_ID|SHOP_ID|
+---------------+---------------+-------+
|           ....|           3184|    617|
|           ....|           3185|    616|
|           ....|           3175|    615|
|           ....|           3190|    614|
|           ....|           3183|    613|
|           ....|           3174|    612|
|           ....|           3186|    611|
|           ....|           3171|    610|
|           ....|           3191|    609|
|           ....|           3110|    608|
|           ....|           3104|    607|
|           ....|           3102|    606|
|           ....|           3106|    605|
|           ....|           3100|    604|
|           ....|           3097|    603|
|           ....|           3088|    602|
|           ....|           3091|    601|
|           ....|           3090|    600|
|           ....|           3076|    599|
|           ....|           3089|    598|
+---------------+---------------+-------+

That is what I expect,but I am writing a common function which will be called by others,I want others just give a sql as a parameter,but not need to write the additional code .orderBy(desc("XXX"))

So why the order by in the kylin sql using spark does not work sometimes?

hometown
  • 279
  • 2
  • 17
  • This is the expected behavior for any db system not only kylin. Spark will populate the data in parallel so from N threads, when merging the sub-parts there is no guarantee that this will be sorted. Therefore you must specify ordering explicitly after retrieving data from an external data source. – abiratsis May 14 '19 at 11:21
  • @AlexandrosBiratsis In the above,I use `conf.setMaster("local")`,but if I use `conf.setMaster("local[1]")`,order is not guaranteed too,why?And If I put it to cluster and use `conf.setMaster("yarn")` ,can it be guaranteed? – hometown May 15 '19 at 01:12
  • because I dont know if there is such a driver that guarantees the order by default! Have you seen such an implementation because I haven't :) Even if you know how the imported data will be stored from the client side still you can't guarantee that server will serve the data in a specific order therefore you must usually process/modify it afterwards. – abiratsis May 15 '19 at 08:21

1 Answers1

0

I agree with @Alexandros Biratsis

The code df.select("SHOP_NAME","SHOP_COMPANY_ID","SHOP_ID").show() will query kylin with sql like

SELECT "SHOP_NAME","SHOP_COMPANY_ID","SHOP_ID" FROM (select * from shop_dim order by COUNTRY desc) the_alias

You can try run this sql on Kylin's web, the result should be ordered.