I am trying to infer schema when I load a csv
file in my SQLContext
using SparkSession
. Please note that I do not want to use class
here as I am trying to infer the data file schema as soon it is loaded as I do not have any info about the data types or column names of the file before loading it.
Here is what I am trying out in Scala:
package example
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import java.io.File
import org.apache.spark.sql.SparkSession
//import sqlContext.implicits._
object SimpleScalaSpark {
def main(args: Array[String]) {
//val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", "local")
.getOrCreate()
//val etl1Rdd = spark.sparkContext.wholeTextFiles("etl1.json").map(x => x._2)
val jsonTbl = spark.sqlContext.read.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true)
.option("dateFormat","MM/dd/yyyy HH:mm")
.csv("s1.csv")
// print the inferred schema
jsonTbl.printSchema
}
}
I am able to get DateTime
, Integer
, Double
, String
as data types for my file. But I want to implement custom data types based on my own regex
patterns such as fields like SSN, VIN-ID, PhoneNumber etc. which all have a fixed pattern which can be detected using regex
. This would make schema extraction process for me more accurate and precise. For example, suppose I have a column which contains data formed of 5 or more alphabets and 2 or more numbers, I can say that this column is of type ID.
Any ideas on if it is possible to do this using Scala/Spark? Please let me know the implementation part as well if possible or a source to technical documentation.