1

I am loading CSV data in spark dataframe with setting inferSchema option to true. Although the schema of my CSV file is always going to be same and I am aware of the exact schema.

Is it a good idea to manually provide the schema instead of inferring the schema? Does explicitly providing schema improves the performance?

vatsal mevada
  • 5,148
  • 7
  • 39
  • 68

1 Answers1

3

Yes, it's good. Schema Infter will cause that file will be read twice - once for Schema Infer, second for read into Dataset.

From Spark code for DataFrameReader - similar is in DataStreamReader:

This function will go through the input once to determine the input schema if inferSchema is enabled. To avoid going through the entire data once, disable inferSchema option or specify the schema explicitly using schema.

Link to code

However, it may be difficult to maintain schema for 100 Datasets with 200 columns each. You should also have in mind maintainability - so, typical answer will be - it depends :) For not-so-big schemas or not-so-difficult infer but with large files, I recommend using custom schema written in code

T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
  • Can you please cite some source backing your first statement? – vatsal mevada Aug 09 '17 at 09:02
  • 1
    @vatsalmevada No, but you can simply observe it - with inferSchema there will be additional 3 jobs when calling csv(). With custom schema, the first job will be your job, for example count() – T. Gawęda Aug 09 '17 at 09:15
  • [inferSchema automatically infers column types. It requires one extra pass over the data](https://github.com/databricks/spark-csv#features) – Fabich Aug 09 '17 at 09:15
  • Thanks :) @vatsalmevada I have also added comment from Spark code – T. Gawęda Aug 09 '17 at 09:20