SQL DDL to parse JSON schema file

Question

Can a SQL DDL statement be parsed to a simple JSON schema file as shown below without using any tools, only Scala/Python/shell scripting?

CREATE TABLE TEMP (
      ID INT,
      NAME STRING)

[
  {
    "tableName": "temp",
    "columns": [
      {
        "columnname": "id",
        "datatype": "int"
      },
      {
        "columnname": "name",
        "datatype": "string"
      }
    ]
  }
]

score 1 · Accepted Answer · answered Apr 04 '19 at 01:24

You can create a string in the JSON form from your DDL using the below logic (Scala code). Once the string is made, it is converted into a Dataframe. This Dataframe is then saved into an HDFS/Amazon S3 as JSON file using Dataframe's built in API called write.json

import org.apache.spark.sql.types._
import spark.implicits._
val createSql = "CREATE TABLE TEMP (ID INT, NAME STRING)"
var jsonString = """[{"tableName":"""" + createSql.split(" ")(2).toLowerCase + "\"," + "\"columns\":["
createSql.split(s"\\(")(1).split(s"\\)")(0).split(",").map(r => {
jsonString += "{" + "\"columnname\": " + "\"" + r.trim.split(" ")(0).toLowerCase + "\"," + "\"datatype\": " + "\"" + r.trim.split(" ")(1).toLowerCase + "\"},"
})
jsonString = jsonString.patch(jsonString.lastIndexOf(','), "", 1) + "]}]"
val schema: StructType = null
val reader = spark.read
Option(schema).foreach(reader.schema)
val df = reader.json(sc.parallelize(Array(jsonString)))
df.coalesce(1).write.json("<targetlocation>")

Please let me know if you have any questions.

It's perfectly worked for me. If I have crease sql in a file with multiple DDL statements and there are some comments in between need to avoid in the same file. ex: `---table 1 createSQL1 --table 2 createSql2 --table createSql3 ` — Rajesh, Apr 06 '19 at 11:54

score 0 · Answer 2 · answered Apr 03 '19 at 23:59

With some matching as e.g described here: How to pattern match using regular expression in Scala? the code for this could look like the below, assuming your initial expression is passed in as sequence of lines (note that JSONObject as used below is deprecated, so replace this with some alternative).

object Parser {

    implicit class Regex(sc: StringContext) {
      def r = new util.matching.Regex(sc.parts.mkString, sc.parts.tail.map(_ => "x"): _*)
    }

    def toJson(tablename: String, columns: Seq[(String,String)]): String = {
      val columnList: List[JSONObject] = columns.toStream.map(x => JSONObject(Map("columnname" -> x._1, "datatype" -> x._2))).toList
      JSONArray(List(JSONObject(Map("tableName" -> tablename, "columns" -> JSONArray(columnList))))).toString()
    }

    def parse(lines: Seq[String]): (String, Seq[(String,String)]) = {
      lines.mkString("").toLowerCase match {
        case r"create\s+table\s+(\S+)${tablename}\s+\((.+)${columns}\).*" =>
          val columnWithType: immutable.Seq[(String, String)] = columns.split(",").toStream
            .map(x => x.split("\\s+"))
            .map(x => (x.head.toLowerCase, x(1).toLowerCase))
          (tablename, columnWithType)
        case _ => ("",Seq.empty)
      }
    }
  }

To test that with your test string:

val data: (String, Seq[(String, String)]) = Parser.parse(Seq("CREATE TABLE TEMP (", "ID INT,", "NAME STRING)"))
      println(Parser.toJson(data._1, data._2))

score 0 · Answer 3 · answered Apr 04 '19 at 11:23

With scala.util.parsing.combinator package, u could define your Lexer and Grammer parser with DDL like this,

import scala.util.parsing.combinator._

class JSON extends JavaTokenParsers {
  def value: Parser[Any] = obj | arr | stringLiteral | floatingPointNumber | "null" | "true" | "false"
  def obj: Parser[Any] = "{"~repsep(member, ",")~"}"
  def arr: Parser[Any] = "["~repsep(value, ",")~"]"
  def member: Parser[Any] = stringLiteral~":"~value
}

Above code would be used to parse JSON string into a lexel stream for further procession. Read the doc u will be able to define your SQL DDL parser.

score 0 · Answer 4 · answered Jul 01 '19 at 08:18

We just published this package in https://github.com/deepstartup/jsonutils. May be you will find it useful. If you need us to update something, open up a JIRA.

Try:

    pip install DDLJ
    from DDLj import genddl
    genddl(*param1,param2,*param3,*param4)

Where
param1= JSON Schema File
param2=Database (Default Oracle)
Param3= Glossary file
Param4= DDL output script

SQL DDL to parse JSON schema file

4 Answers4