2

Can a SQL DDL statement be parsed to a simple JSON schema file as shown below without using any tools, only Scala/Python/shell scripting?

CREATE TABLE TEMP (
      ID INT,
      NAME STRING)

[
  {
    "tableName": "temp",
    "columns": [
      {
        "columnname": "id",
        "datatype": "int"
      },
      {
        "columnname": "name",
        "datatype": "string"
      }
    ]
  }
]
gmds
  • 19,325
  • 4
  • 32
  • 58
Rajesh
  • 61
  • 2
  • 5

4 Answers4

1

You can create a string in the JSON form from your DDL using the below logic (Scala code). Once the string is made, it is converted into a Dataframe. This Dataframe is then saved into an HDFS/Amazon S3 as JSON file using Dataframe's built in API called write.json

import org.apache.spark.sql.types._
import spark.implicits._
val createSql = "CREATE TABLE TEMP (ID INT, NAME STRING)"
var jsonString = """[{"tableName":"""" + createSql.split(" ")(2).toLowerCase + "\"," + "\"columns\":["
createSql.split(s"\\(")(1).split(s"\\)")(0).split(",").map(r => {
jsonString += "{" + "\"columnname\": " + "\"" + r.trim.split(" ")(0).toLowerCase + "\"," + "\"datatype\": " + "\"" + r.trim.split(" ")(1).toLowerCase + "\"},"
})
jsonString = jsonString.patch(jsonString.lastIndexOf(','), "", 1) + "]}]"
val schema: StructType = null
val reader = spark.read
Option(schema).foreach(reader.schema)
val df = reader.json(sc.parallelize(Array(jsonString)))
df.coalesce(1).write.json("<targetlocation>")

Please let me know if you have any questions.

Sarath Subramanian
  • 20,027
  • 11
  • 82
  • 86
  • It's perfectly worked for me. If I have crease sql in a file with multiple DDL statements and there are some comments in between need to avoid in the same file. ex: `---table 1 createSQL1 --table 2 createSql2 --table createSql3 ` – Rajesh Apr 06 '19 at 11:54
0

With some matching as e.g described here: How to pattern match using regular expression in Scala? the code for this could look like the below, assuming your initial expression is passed in as sequence of lines (note that JSONObject as used below is deprecated, so replace this with some alternative).

object Parser {

    implicit class Regex(sc: StringContext) {
      def r = new util.matching.Regex(sc.parts.mkString, sc.parts.tail.map(_ => "x"): _*)
    }

    def toJson(tablename: String, columns: Seq[(String,String)]): String = {
      val columnList: List[JSONObject] = columns.toStream.map(x => JSONObject(Map("columnname" -> x._1, "datatype" -> x._2))).toList
      JSONArray(List(JSONObject(Map("tableName" -> tablename, "columns" -> JSONArray(columnList))))).toString()
    }

    def parse(lines: Seq[String]): (String, Seq[(String,String)]) = {
      lines.mkString("").toLowerCase match {
        case r"create\s+table\s+(\S+)${tablename}\s+\((.+)${columns}\).*" =>
          val columnWithType: immutable.Seq[(String, String)] = columns.split(",").toStream
            .map(x => x.split("\\s+"))
            .map(x => (x.head.toLowerCase, x(1).toLowerCase))
          (tablename, columnWithType)
        case _ => ("",Seq.empty)
      }
    }
  }

To test that with your test string:

val data: (String, Seq[(String, String)]) = Parser.parse(Seq("CREATE TABLE TEMP (", "ID INT,", "NAME STRING)"))
      println(Parser.toJson(data._1, data._2))
awagen
  • 236
  • 2
  • 8
0

With scala.util.parsing.combinator package, u could define your Lexer and Grammer parser with DDL like this,

import scala.util.parsing.combinator._

class JSON extends JavaTokenParsers {
  def value: Parser[Any] = obj | arr | stringLiteral | floatingPointNumber | "null" | "true" | "false"
  def obj: Parser[Any] = "{"~repsep(member, ",")~"}"
  def arr: Parser[Any] = "["~repsep(value, ",")~"]"
  def member: Parser[Any] = stringLiteral~":"~value
} 

Above code would be used to parse JSON string into a lexel stream for further procession. Read the doc u will be able to define your SQL DDL parser.

Ruoyu Dai
  • 53
  • 4
0

We just published this package in https://github.com/deepstartup/jsonutils. May be you will find it useful. If you need us to update something, open up a JIRA.

Try:

    pip install DDLJ
    from DDLj import genddl
    genddl(*param1,param2,*param3,*param4)

Where
param1= JSON Schema File
param2=Database (Default Oracle)
Param3= Glossary file
Param4= DDL output script