14

I want to validate if spark-sql query is syntactically correct or not without actually running the query on the cluster.

Actual use case is that I am trying to develop a user interface, which accepts user to enter a spark-sql query and I should be able to verify if the query provided is syntactically correct or not. Also if after parsing the query, I can give any recommendation about the query with respect to spark best practices that would be best.

Def_Os
  • 5,301
  • 5
  • 34
  • 63
Manish Mehra
  • 1,381
  • 1
  • 16
  • 24
  • 1
    How about executing: `val new = spark.sql(query)` - it will be parsed, but the query will not be executed. If there was an exception - then the syntax is most probably wrong... – MaxU - stand with Ukraine Oct 27 '17 at 12:51

4 Answers4

7

SparkSqlParser

Spark SQL uses SparkSqlParser as the parser for Spark SQL expressions.

You can access SparkSqlParser using SparkSession (and SessionState) as follows:

val spark: SparkSession = ...
val parser = spark.sessionState.sqlParser

scala> parser.parseExpression("select * from table")
res1: org.apache.spark.sql.catalyst.expressions.Expression = ('select * 'from) AS table#0

TIP: Enable INFO logging level for org.apache.spark.sql.execution.SparkSqlParser logger to see what happens inside.

SparkSession.sql Method

That alone won't give you the most bullet-proof shield against incorrect SQL expressions and think sql method is a better fit.

sql(sqlText: String): DataFrame Executes a SQL query using Spark, returning the result as a DataFrame. The dialect that is used for SQL parsing can be configured with 'spark.sql.dialect'.

See both in action below.

scala> parser.parseExpression("hello world")
res5: org.apache.spark.sql.catalyst.expressions.Expression = 'hello AS world#2

scala> spark.sql("hello world")
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'hello' expecting {'(', 'SELECT', 'FROM', 'ADD', 'DESC', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'INSERT', 'DELETE', 'DESCRIBE', 'EXPLAIN', 'SHOW', 'USE', 'DROP', 'ALTER', 'MAP', 'SET', 'RESET', 'START', 'COMMIT', 'ROLLBACK', 'REDUCE', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'DFS', 'TRUNCATE', 'ANALYZE', 'LIST', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'EXPORT', 'IMPORT', 'LOAD'}(line 1, pos 0)

== SQL ==
hello world
^^^

  at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:217)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:114)
  at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:68)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
  ... 49 elided
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • So it is not possible to validate a query without executing it ? – Ayoub Omari Mar 12 '20 at 09:25
  • 1
    It is. See that part about SparkSqlParser. A more sophisticated validation would require logical validation (i.e. checking whether tables, columns and types match). I think it's possible using QueryExecution and a few logical operators. What validation do you think about? – Jacek Laskowski Mar 12 '20 at 10:25
  • For my case, I am thinking especially about the syntactical validation of the query (which I guess happens in an early stage of Catalyst), and also checking whether query selected columns exist in the view (which I think happens in a subsequent stage) – Ayoub Omari Mar 12 '20 at 10:39
3

I found few issues using SparkSqlParser and it didn't support my use case. For instance, below valid Spark SQL dialect threw an exception,

val parser = spark.sessionState.sqlParser
val query = "select * from a.b where datamonth = '202203' limit 10"
parser.parseExpression(query)

Exception

org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '.' expecting {<EOF>, '-'}(line 1, pos 15)

== SQL ==
select * from a.b where datamonth = '202203' limit 10
---------------^^^

Solution

So, below solution based on explain() helped me with syntactic and logical correctness.

import org.apache.spark.sql.execution.ExplainMode

import scala.util.{Failure, Success, Try}

def validateQuery(query: String) = Try(spark.sql(query).queryExecution.explainString(ExplainMode.fromString("simple"))) match {
    case Success(plan) => "SUCCESS"  + plan
    case Failure(ex) => "FAILURE => " + ex.getMessage
}

Test 1 - Valid query

validQuery: String = select * from bi.booking where datamonth = '202203' limit 10

"SUCCESS== Physical Plan ==
CollectLimit 10
+- FileScan parquet bi.booking[booking_id#1337613396L,page_type_id#13397L,platform_id#13398L,page_type_platform#13399,... 5 more fields] Batched: false, DataFilters: [], Format: Parquet, Loc...

Test 2 - Syntactically invalid query

invalidQuery: String = select * from bi.booking datamonth = '202203' limit 10

"FAILURE =>
mismatched input '=' expecting {<EOF>, ';'}(line 1, pos 47)

== SQL ==
select * from bi.booking datamonth = '202203' limit 10
-------------------------------^^^
"

Test 3 - Query with non existing table

noTableQuery: String = select * from a.b where datamonth = '202203' limit 10

"FAILURE => Table or view not found: a.b; line 1 pos 14;
'GlobalLimit 10
+- 'LocalLimit 10
   +- 'Project [*]
      +- 'Filter ('datamonth = 202203)
         +- 'UnresolvedRelation [a, b], [], false
Kondasamy Jayaraman
  • 1,802
  • 1
  • 20
  • 25
  • 2
    Hi, is there a way to skip table/column exist validation? For example , first statement is to create table and second is to do select on this new table. If we do validation on second statement(without running) it will fail because table does not exist. However syntactically the query is correct. – Varun Nov 15 '22 at 05:46
  • I don't think it's possible. Schema validation will still fail – Kondasamy Jayaraman Nov 16 '22 at 08:35
2

Following @JacekLaskowski 's answer, I found that SparkSqlParser gave me all sorts of errors that were not really syntax errors.

I therefore agree with him and suggest simply throwing it into SparkSession.sql, which works fine. This is what my method looks like:

  /**
   * Validates a Spark SQL statement by trying to execute it and checking
   * if there are no syntax-related exceptions.
   */
  def validate(sqlStatement: String): Unit = {
    val spark = SparkSession.builder
      .master("local")
      .getOrCreate()
    try {
      spark.sql(sqlStatement)
    } catch {
      case ex: ParseException => throw new MyCustomException("Invalid Spark SQL", ex)
      case _: AnalysisException => // Syntax was correct
    }
  }
Def_Os
  • 5,301
  • 5
  • 34
  • 63
2

For PySpark you can use the following

from pyspark.sql import SparkSession
from pyspark.sql.utils import ParseException


spark_session = SparkSession \ 
    .builder \
    .master("local[1]") \
    .appName('test') \
    .getOrCreate()

try:
    parser = spark_session._jsparkSession.sessionState().sqlParser()
    parser.parseExpression(query_string)
    print('Validation successful')
except ParseException as e:
    # do something
Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156