select latest partition of data from hive by sparksql

Question

I have a hive table partitoned by ds where ds is a string like 2017-11-07. Since string is comparable, I want to select latest partition of data from hive by sparksql, so I wrote this code:

Dataset<Row> ds = sparkSession.sql("select max(ds) from admin_zpw123_user_meta");

String s = ds.first().getString(0);

sparkSession.sql("select * from admin_zpw123_user_meta where ds="+s).show();

I can print the string s, which is 2017-11-07 but I didn't get any output from the third statement. I want to know why and is there an elegent way to do this?

score 2 · Accepted Answer · answered Nov 08 '17 at 05:07

2

You need to have single quotes around the 2017-11-07 string when using it in the SQL statement. You can add it to the query like this:

sparkSession.sql("select * from admin_zpw123_user_meta where ds='" + s + "'").show();

answered Nov 08 '17 at 05:07

Shaido

27,497
23
70
73

score 0 · Answer 2 · answered Nov 08 '17 at 04:38

0

I just add ’’ to 2017-11-07 then it works, but it’s still not that elegant

answered Nov 08 '17 at 04:38

zpwpal

183
5
12

score 0 · Answer 3 · answered Jul 09 '18 at 00:19

0

Actions are very inefficient in spark, and you have an useless one on:

String s = ds.first().getString(0);

To fix that, you can filter only the latest partition date simple by doing:

sparkSession.sql("select * from admin_zpw123_user_meta where ds in (select max(distinct ds) from admin_zpw123_user_meta)").show();

answered Jul 09 '18 at 00:19

Henrique Florencio

3,440
1
18
19

1

This query will work but under the hood It will still scan all the partition available. if you check plan for above query.. something like below will scan only one partition. we can check it in plan as well. https://stackoverflow.com/questions/55053218/pyspark-getting-latest-partition-from-hive-partitioned-column-logic/57440760#57440760 – vikrant rana Aug 14 '19 at 13:33

select latest partition of data from hive by sparksql

3 Answers3