0

I am trying to infer schema when I load a csv file in my SQLContext using SparkSession. Please note that I do not want to use class here as I am trying to infer the data file schema as soon it is loaded as I do not have any info about the data types or column names of the file before loading it.

Here is what I am trying out in Scala:

package example
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import java.io.File
import org.apache.spark.sql.SparkSession
//import sqlContext.implicits._


object SimpleScalaSpark {
  def main(args: Array[String]) {
    //val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
    val spark = SparkSession
      .builder()
      .master("local[*]")
      .appName("Spark Hive Example")
      .config("spark.sql.warehouse.dir", "local")
      .getOrCreate()
    //val etl1Rdd = spark.sparkContext.wholeTextFiles("etl1.json").map(x => x._2)
    val jsonTbl = spark.sqlContext.read.format("org.apache.spark.csv")
      .option("header", true)
      .option("inferSchema", true)
      .option("dateFormat","MM/dd/yyyy HH:mm")
      .csv("s1.csv")

    // print the inferred schema
    jsonTbl.printSchema
  }
}

I am able to get DateTime, Integer, Double, String as data types for my file. But I want to implement custom data types based on my own regex patterns such as fields like SSN, VIN-ID, PhoneNumber etc. which all have a fixed pattern which can be detected using regex. This would make schema extraction process for me more accurate and precise. For example, suppose I have a column which contains data formed of 5 or more alphabets and 2 or more numbers, I can say that this column is of type ID.

Any ideas on if it is possible to do this using Scala/Spark? Please let me know the implementation part as well if possible or a source to technical documentation.

CodeHunter
  • 2,017
  • 2
  • 21
  • 47
  • 1
    _I can say that this column is of type ID_ - it is not clear what you mean. UDTs has been made private long time ago, and there is no id type in Spark. Also, inference mechanism is internal and not plugable. If you are up to modifying Spark source, it is not hard to find. – Alper t. Turker May 29 '18 at 22:06
  • I mean that I want to define a certain column type as per my way. Like say if I have a certain types of columns whose values contains 10 numbers with characters such as (, /, - in between, then it can be a contact number. Likewise, I need to define my own custom types. If it is possible to do so by modifying source code, can you point to such source code? I would happily work on that. – CodeHunter May 29 '18 at 22:13
  • You can start with these two questions - [How to define schema for custom type in Spark SQL?](https://stackoverflow.com/q/32440461/8371915) and [How to force inferSchema for CSV to consider integers as dates (with “dateFormat” option)?](https://stackoverflow.com/q/46529404/8371915) – Alper t. Turker May 29 '18 at 22:48

0 Answers0