4

Anyone knows what is the difference between spark.read.format("csv") vs spark.read.csv?

Some say "spark.read.csv" is an alias of "spark.read.format("csv")", but I saw a difference between the 2. I did an experiment executing each command below with a new pyspark session so that there is no caching.

DF1 took 42 secs while DF2 took just 10 secs. The csv file is 60+ GB.

DF1 = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("hdfs://bda-ns/user/project/xxx.csv")

DF2 = spark.read.option("header", "true").csv("hdfs://bda-ns/user/project/xxx.csv")

The reason why I dig on this issue was because I have need to do a union on 2 dataframes after filter and then write back to hdfs and it took super long time to write (still writing after 16 hrs....)

user1342124
  • 601
  • 1
  • 7
  • 15

1 Answers1

7

Basically they are totally the same when you call one of them. But in you implementations are difference

With DF1, you add inferSchema option, it will slow down the process, that explains why DF1 took more time than the second

inferSchema: automatically infers column types. It requires one extra pass over the data and is false by default, Detail document

Duy Nguyen
  • 985
  • 5
  • 9
  • Can someone please help me out, in what scenarios we should use spark.read.csv("path") and spark.read.format("csv").load("path")? – Gopesh Jan 12 '21 at 13:15
  • 4
    @Gopesh They are totally similar, but imaging you want to load different file format according to logic, ```spark.read.format("csv" if SOMETHING else "ORC").load("path")```, it is more readability than when you use if else for ```spark.read.csv("path")``` – Duy Nguyen May 10 '21 at 16:32
  • @DuyNguyen if the infer schema is false by default but we don't pass a schema then how does it guess a schema? – Eugenio.Gastelum96 Jul 16 '23 at 21:07