How to create a schema from CSV file and persist/save that schema to a file?

Question

I have CSV file with 10 columns. Half String and half are Integers.

What is the Scala code to:

Create (infer) the schema
Save that schema to a file

I have this so far:

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .load("cars.csv")

And what is the best file format for saving that schema? Is it JSON?

Goal is - I want to create schema only once and next time load from a file instead of re-creating it on a fly.

Thanks.

score 8 · Answer 1 · edited Feb 20 '19 at 00:53

8

DataType API provided all the required utilities so JSON is a natural choice:

import org.apache.spark.sql.types._
import scala.util.Try

val df = Seq((1L, "foo", 3.0)).toDF("id", "x1", "x2")
val serializedSchema: String = df.schema.json


def loadSchema(s: String): Option[StructType] =
  Try(DataType.fromJson(s)).toOption.flatMap {
    case s: StructType => Some(s)
    case _ => None 
  }

loadSchema(serializedSchema)

Depending on you requirements you can use standard Scala methods to write this to file, or hack Spark RDD:

val schemaPath: String = ???

sc.parallelize(Seq(serializedSchema), 1).saveAsTextFile(schemaPath)
val loadedSchema: Option[StructType] = sc.textFile(schemaPath)
  .map(loadSchema)  // Load
  .collect.headOption.flatten  // Make sure we don't fail if there is no data

For a Python equivalent see Config file to define JSON Schema Struture in PySpark

edited Feb 20 '19 at 00:53

Community

1
1

answered Feb 02 '17 at 14:18

zero323

322,348
103
959
935

I guess the Scala API docs are [missing](http://spark.apache.org/docs/2.4.5/api/scala/index.html#org.apache.spark.sql.types.DataType) for `DataType.fromJson()`. I assume it's a method intended for public use given that PySpark has an equivalent method and it's [publicly documented](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.types.StructType.fromJson) for `StructType` and others. – Nick Chammas Mar 14 '20 at 21:41
The loadedSchema object is unusable for me. I want to read a DataFrame and apply that schema and I get the following error: – BuahahaXD Nov 12 '20 at 10:06
overloaded method value schema with alternatives: (schemaString: String)org.apache.spark.sql.streaming.DataStreamReader (schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.streaming.DataStreamReader cannot be applied to (Option[org.apache.spark.sql.types.StructType]) – BuahahaXD Nov 12 '20 at 10:06

How to create a schema from CSV file and persist/save that schema to a file?

1 Answers1

Linked