We use apache spark to query data from apache kylin via jdbc,we write a sql with a order by
clause,but sometimes it does not work.
Briefly,there is a table called shop_dim
in kylin,a field SHOP_ID
is not in descending order,I use spark jdbc to query sql select * from shop_dim order by shop_id desc
to get a dataframe,If I select all the columns,order by
not work,but if I select only a few fields,it works.
By the way,I have a guess:if the kylin table is dimension,this problem occurs,if the table is measure,it is ok.
It is the details below:
When I run query select * from shop_dim
on the kylin page of web browser (http://someip:7070/kylin/query#query_content_results)
The result is:
+---------------+---------------+-------+
| SHOP_NAME|SHOP_COMPANY_ID|SHOP_ID|
+---------------+---------------+-------+
| ...| 839| 1|
| ....| 264| 2|
| ....| 790| 3|
| ....| 199| 4|
| ....| 692| 5|
| ....| 673| 6|
| ....| 647| 7|
| ....| 753| 8|
| ....| 711| 9|
| ....| 3090| 600|
| ....| 3091| 601|
| ....| 3088| 602|
| ....| 3097| 603|
| ....| 3100| 604|
| ....| 3106| 605|
| ....| 3102| 606|
| ....| 3104| 607|
| ....| 3110| 608|
| ....| 3191| 609|
| ....| 3171| 610|
...
...
| ....| 2046| 500|
+---------------+---------------+-------+
If I query select * from shop_dim order by shop_id desc
on Kylin web,
the result is:
+---------------+---------------+-------+
| SHOP_NAME|SHOP_COMPANY_ID|SHOP_ID|
+---------------+---------------+-------+
| ....| 3184| 617|
| ....| 3185| 616|
| ....| 3175| 615|
| ....| 3190| 614|
| ....| 3183| 613|
| ....| 3174| 612|
| ....| 3186| 611|
| ....| 3171| 610|
| ....| 3191| 609|
| ....| 3110| 608|
| ....| 3104| 607|
| ....| 3102| 606|
| ....| 3106| 605|
| ....| 3100| 604|
| ....| 3097| 603|
| ....| 3088| 602|
| ....| 3091| 601|
| ....| 3090| 600|
| ....| 3076| 599|
| ....| 3089| 598|
+---------------+---------------+-------+
Then I try to use spark:
object test {
val conf =new SparkConf()
conf.setMaster("local")
val spark=SparkSession.builder().config(conf).getOrCreate()
def main(args: Array[String]): Unit = {
val url="jdbc:kylin://someip:7070/fuyun"
val tab_name = s"(select * from shop_dim order by shop_id desc) the_alias"
val df = spark.read.format("jdbc").option("url",url).option("user", "admin")
.option("password", "KYLIN").option("driver", "org.apache.kylin.jdbc.Driver")
.option("dbtable", tab_name).load()
df.select("SHOP_ID").show()
df.select("SHOP_NAME","SHOP_ID").show()
df.select("SHOP_NAME","SHOP_COMPANY_ID","SHOP_ID").show()
df.select("SHOP_NAME","SHOP_COMPANY_ID","SHOP_ID").orderBy(desc("SHOP_ID")).show()
}
}
There are 4 show
on the bottom,first two works as expected,the third not,the fourth is in right order.
the result of df.select("SHOP_ID").show()
+-------+
|SHOP_ID|
+-------+
| 617|
| 616|
| 615|
| 614|
| 613|
| 612|
| 611|
| 610|
| 609|
| 608|
| 607|
| 606|
| 605|
| 604|
| 603|
| 602|
| 601|
| 600|
| 599|
| 598|
+-------+
the order is correct.
the result of df.select("SHOP_NAME","SHOP_ID").show()
+---------------+-------+
| SHOP_NAME|SHOP_ID|
+---------------+-------+
| ....| 617|
| ....| 616|
| ....| 615|
| ....| 614|
| ....| 613|
| ....| 612|
| ....| 611|
| ....| 610|
| ....| 609|
| ....| 608|
| ....| 607|
| ....| 606|
| ....| 605|
| ....| 604|
| ....| 603|
| ....| 602|
| ....| 601|
| ....| 600|
| ....| 599|
| ....| 598|
+---------------+-------+
It is all right too.
But the result of df.select("SHOP_NAME","SHOP_COMPANY_ID","SHOP_ID").show()
+---------------+---------------+-------+
| SHOP_NAME|SHOP_COMPANY_ID|SHOP_ID|
+---------------+---------------+-------+
| ....| 839| 1|
| ....| 264| 2|
| ....| 790| 3|
| ....| 199| 4|
| ....| 692| 5|
| ....| 673| 6|
| ....| 647| 7|
| ....| 753| 8|
| ....| 711| 9|
| ....| 3090| 600|
| ....| 3091| 601|
| ....| 3088| 602|
| ....| 3097| 603|
| ....| 3100| 604|
| ....| 3106| 605|
| ....| 3102| 606|
| ....| 3104| 607|
| ....| 3110| 608|
| ....| 3191| 609|
| ....| 3171| 610|
+---------------+---------------+-------+
The order is not descending now,seems it is the same as the original order in kylin.
So I must add another .orderBy(desc("SHOP_ID"))
to make it work:
df.select("SHOP_NAME","SHOP_COMPANY_ID","SHOP_ID").orderBy(desc("SHOP_ID")).show()
the result is:
+---------------+---------------+-------+
| SHOP_NAME|SHOP_COMPANY_ID|SHOP_ID|
+---------------+---------------+-------+
| ....| 3184| 617|
| ....| 3185| 616|
| ....| 3175| 615|
| ....| 3190| 614|
| ....| 3183| 613|
| ....| 3174| 612|
| ....| 3186| 611|
| ....| 3171| 610|
| ....| 3191| 609|
| ....| 3110| 608|
| ....| 3104| 607|
| ....| 3102| 606|
| ....| 3106| 605|
| ....| 3100| 604|
| ....| 3097| 603|
| ....| 3088| 602|
| ....| 3091| 601|
| ....| 3090| 600|
| ....| 3076| 599|
| ....| 3089| 598|
+---------------+---------------+-------+
That is what I expect,but I am writing a common function which will be called by others,I want others just give a sql as a parameter,but not need to write the additional code .orderBy(desc("XXX"))
So why the order by in the kylin sql using spark does not work sometimes?