2

I have a hive table partitoned by ds where ds is a string like 2017-11-07. Since string is comparable, I want to select latest partition of data from hive by sparksql, so I wrote this code:

Dataset<Row> ds = sparkSession.sql("select max(ds) from admin_zpw123_user_meta");

String s = ds.first().getString(0);

sparkSession.sql("select * from admin_zpw123_user_meta where ds="+s).show();

I can print the string s, which is 2017-11-07 but I didn't get any output from the third statement. I want to know why and is there an elegent way to do this?

Shaido
  • 27,497
  • 23
  • 70
  • 73
zpwpal
  • 183
  • 5
  • 12

3 Answers3

2

You need to have single quotes around the 2017-11-07 string when using it in the SQL statement. You can add it to the query like this:

sparkSession.sql("select * from admin_zpw123_user_meta where ds='" + s + "'").show();
Shaido
  • 27,497
  • 23
  • 70
  • 73
0

I just add ’’ to 2017-11-07 then it works, but it’s still not that elegant

zpwpal
  • 183
  • 5
  • 12
0

Actions are very inefficient in spark, and you have an useless one on:

String s = ds.first().getString(0);

To fix that, you can filter only the latest partition date simple by doing:

sparkSession.sql("select * from admin_zpw123_user_meta where ds in (select max(distinct ds) from admin_zpw123_user_meta)").show();
Henrique Florencio
  • 3,440
  • 1
  • 18
  • 19
  • 1
    This query will work but under the hood It will still scan all the partition available. if you check plan for above query.. something like below will scan only one partition. we can check it in plan as well. https://stackoverflow.com/questions/55053218/pyspark-getting-latest-partition-from-hive-partitioned-column-logic/57440760#57440760 – vikrant rana Aug 14 '19 at 13:33