9

Is there a way to validate the syntax of raw spark SQL query?

for example, I would like to know if there is any isValid API call spark provides?

val query = "select * from table"
if(isValid(query)) {
    sparkSession.sql(query) 
} else {
    log.error("Invalid Syntax")
}

I tried the following

val query = "select * morf table" // Invalid query
val parser = spark.sessionState.sqlParser
try{
    parser.parseExpression(query)
} catch (ParseException ex) {
    throw new Exception(ex); //Exception not getting thrown
}
Dataset<>Row df = sparkSession.sql(query) // Exception gets thrown here
df.writeStream.format("console").start()

Question: parser.parseExpression is not catching the invalid syntax before I hit the sparkSession.sql. Other words it is not being helpful in the above code. any reason? My whole goal is to catch syntax errors before I pass it on to sparkSession.sql

user1870400
  • 6,028
  • 13
  • 54
  • 115
  • You are basically asking how to catch an exception. Look into `scala.util.Try(spark.sql.(query) ) match { }` probably? – philantrovert Feb 19 '18 at 11:28
  • I am more looking to catch syntax error before even sending the query to spark and waiting for the exception to happen. – user1870400 Feb 19 '18 at 11:31
  • Good coding followed by equally good testing should obviate such requirements. – thebluephantom Aug 07 '19 at 18:48
  • 3
    @thebluephantom you are missing the whole point. There should be a way to syntactically validate a query before executing it. That way we don't throw some garbage to the execution engine but if Spark SQL Parser cannot provide that then there is no "Good coding" that can prevent it. – user1870400 Aug 07 '19 at 18:53
  • Agree to disagree. – thebluephantom Aug 07 '19 at 18:54
  • 3
    @thebluephantom sometimes executor application is just an interface which allows user to run arbitrary spark queries. If spark cluster remains busy, sometimes query executor gets error 1-2 hours after submission of query notifying that there was an error in syntax. I am using below suggested approach, though may not be best way, it gives validation result pretty quick to end user. – tarunkumar Aug 07 '19 at 22:15

3 Answers3

6

Use Spark SQL Parser:

val parser = spark.sessionState.sqlParser
parser.parseExpression("select * from table")
Yehor Krivokon
  • 837
  • 5
  • 17
  • Hi! I tried this but it didn't quite work. I've edited the question where I included what I tried. can you please help? – user1870400 Feb 22 '18 at 13:09
3

After many research without an answer, I found a way to do it by reading some spark sql code. These lines of code do the job:

val query = "select * from table"
val logicalPlan = spark.sessionState.sqlParser.parsePlan(query) //parse the query and build the AST
val queryExecution = spark.sessionState.executePlan(logicalPlan) // create plans of phases (Analysis, logical optimization plan, physical plan, ...) without any execution
queryExecution.assertAnalyzed() //trigger the Analysis phase

Note that it is in the analysis phase where we have the table resolution, column names resolution etc.. So this basically does a syntax and context analysis

Ayoub Omari
  • 806
  • 1
  • 7
  • 24
  • Does this do only syntax validation? Can I run this with spark session in a local machine even if hive metastore is not available? – r4ravi2008 Sep 14 '21 at 22:00
2

Though it's late but may be helpful for someone else. Using

spark.sessionState.sqlParser

doesn't seems to be much useful as it doesn't complain even for query string as 'Hello World'

Better is to use,

sparkSession.sql(query)

which actually validates Query syntax. Since spark primarily works lazily, ideally below one should not do any operation on actual dataset.

  def validateSql(query: String, session: SparkSession): Boolean = {
    try {
      val result1 = session.sql(query)
      println(s"No Syntax Error [$result1] Query [$query]")
      return true;
    } catch {
      case e: Throwable => {
        println(s"Query has syntax error. Error [${e.getMessage}] Query [$query]")
        return false
      }
    }
  }

If you look Spark logs for above code then it would be something similar to

19/08/07 11:17:09 INFO SparkSqlParser: Parsing command: select * from XYZ limit 5
19/08/07 11:17:10 INFO CatalystSqlParser: Parsing command: bigint
19/08/07 11:17:10 INFO CatalystSqlParser: Parsing command: bigint
19/08/07 11:17:10 INFO CatalystSqlParser: Parsing command: bigint
19/08/07 11:17:10 INFO CatalystSqlParser: Parsing command: bigint
No Syntax Error [[A: bigint, BC: bigint ... 2 more fields]] Query [select * from XYZ limit 5]

Details of same can be found in another question of SO. How to validate SQL expression (in user frontend) without executing it?

However since sparkSession.sql(query) needs table metadata to validate table columns, it's best to point directly to hive meta store. Following are reference to do same programatically.

How to connect to a Hive metastore programmatically in SparkSQL?

How to create pom.xml for maven using SparkSql and Hive?

tarunkumar
  • 861
  • 2
  • 15
  • 30
  • 2
    `sparkSession.sql(query)` not only parses but executes the query. The question is how to syntactically validate a query without executing it? – user1870400 Aug 07 '19 at 18:46
  • hmmm may be i am missing something, if one doesn't perform .show on df, no call to select data should be made, so all call may limit to metadata only. Since I have to also perform something similar curious to know if you have found any better solution as this question was raised almost 1+ year back. – tarunkumar Aug 07 '19 at 19:00
  • @user1870400 sorry missed to mention your name in previous comment, did you got any better solution as i need to implement something very similar of mentioned use case. – tarunkumar Aug 07 '19 at 22:16
  • No I don't see a way – user1870400 Aug 08 '19 at 09:33