Not able to override Schema of a CSV file in Spark 2.x

Question

I have a CSV file, test.csv:

col
1
2
3
4

When I read it using Spark, it gets the schema of data correct:

val df = spark.read.option("header", "true").option("inferSchema", "true").csv("test.csv")

df.printSchema
root
 |-- col: integer (nullable = true)

But when I override the schema of CSV file and make inferSchema false, then SparkSession is picking up custom schema partially.

val df = spark.read.option("header", "true").option("inferSchema", "false").schema(StructType(List(StructField("custom", StringType, false)))).csv("test.csv")

df.printSchema
root
 |-- custom: string (nullable = true)

I mean only column name (custom) and DataType (StringType) are getting picked up. But, nullable part is being ignored, as it is still coming nullable = true, which is incorrect.

I am not able to understand this behavior. Any help is appreciated !

Similar post at http://stackoverflow.com/questions/41705602/spark-dataframe-schema-nullable-fields — abaghel, Apr 25 '17 at 12:32
It looks similar! But I am able to understand the answer, as to why data source can support nullability ? — himanshuIIITian, Apr 25 '17 at 12:54

score 2 · Answer 1 · answered Apr 25 '17 at 22:02

2

Consider this excerpt from the documentation about Parquet (a popular "Big Data" storage format):

"Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When writing Parquet files, all columns are automatically converted to be nullable for compatibility reasons."

CSV is handled the same way for the same reason.

As for what "compatibility reasons" means, Nathan Marz in his book Big Data describes that an ideal storage schema is both strongly typed for integrity and flexible for evolution. In other words, it should be easy to add and remove fields and not have your analytics blow up. Parquet is both typed and flexible; CSV is just flexible. Spark honors that flexibility by making columns nullable no matter what you do. You can debate whether you like that approach.

A SQL table has schemas rigorously defined and hard to change--so much so Scott Ambler wrote a big book on how to refactor them. Parquet and CSV are much less rigorous. They are both suited to the paradigms for which they were built, and Spark's approach is to take the liberal approach typically associated with "Big Data" storage formats.

answered Apr 25 '17 at 22:02

Vidya

29,932
7
42
70

1

I understand that due to compatibility issues nullable property is set to `true` by default. But, the flexibility to alter it should work if specified explicitly. – himanshuIIITian Apr 26 '17 at 05:55
Your posted question and your comment question ("as to why data source can support nullability?") indicated confusion as to why your explicit setting isn't taking effect and what the data source has to do with it. I answered that question. Your belief that this is a bug is objectively wrong; it is a documented, deliberate feature. Rather than declare anything you don't like a bug, consider raising your concerns on the discussion board and hearing out the opposing view or better yet, submit a pull request. – Vidya Apr 26 '17 at 12:31
I have already reported a bug for it on Spark Issue Tracker - https://issues.apache.org/jira/browse/SPARK-20457. But unfortunately, it has been marked as `Duplicate` :P – himanshuIIITian Apr 26 '17 at 16:48
Yes, because the belief this is a problem is common, but you will also notice that there is a PR to change the behavior and a long thread on GitHub about whether it is a good idea or not. My point is not to judge either way; rather, it is to point out that the behavior is as expected and that the decision to change is much more complex than you realize. – Vidya Apr 26 '17 at 17:06
I understand that data source like CSV, Parquet, etc. does not support nullability, and to make Spark compatible with them, nullability is set to `true` by default. But, my point is if we want to override that behavior programatically and Spark allows us to do that, then why such an unexpected behavior. If Spark does not want us to change it then it should not provide us an API for it. – himanshuIIITian Apr 26 '17 at 17:31
Yes, I understand your points for. I also understand the points against. We've been over this, and repeating your point of view doesn't change anything. Again, I'm not judging either way. While I agree the API might be confusing, especially absent clear documentation on which data sources honor your wishes and which don't, Spark provides that API call because not all data sources are the same. A SQL table and a file-based storage format are fundamentally different because the relational world and the flexible schema/Big Data world are fundamentally different, yet both act as data sources. – Vidya Apr 26 '17 at 17:38

score 1 · Accepted Answer · edited May 23 '17 at 12:09

I believe the “inferSchema” property is common and applicable for all the elements in a dataframe. But, If we want to change the nullable property of a specific element.

We could handle/set something like,

setNullableStateOfColumn(df, ”col", false )

def setNullableStateOfColumn(df:DataFrame, cn: String, nullable: Boolean) : DataFrame = {

  // get schema
  val schema = df.schema
  // modify [[StructField] with name `cn`
  val newSchema = StructType(schema.map {
    case StructField( c, t, _, m) if c.equals(cn) => StructField( c, t, nullable = nullable, m)
    case y: StructField => y
  })
  // apply new schema
  df.sqlContext.createDataFrame( df.rdd, newSchema )
}

There is a similar thread for setting the nullable property of an element,

Change nullable property of column in spark dataframe

Thanks for the response! This solution worked for me. And now I am able to set nullable property according my `schema`. Although, this problem is still a bug in Spark 2.x, but it looks like a great workaround. Thanks again! — himanshuIIITian, Apr 26 '17 at 05:26

Not able to override Schema of a CSV file in Spark 2.x

2 Answers2