1

Do the output of a spark job need to be written to hdfs and downloaded from there. Or could it be written to local file system directly.

Aditya
  • 1,240
  • 2
  • 14
  • 38

1 Answers1

2

Fundamentally, no, you cannot use spark's native writing APIs (e.g. df.write.parquet) to write to local filesystem files. When running in spark local mode (on your own computer, not a cluster), you will be reading/writing from your local filesystem. However, in a cluster setting (standalone/YARN/etc), writing to HDFS is the only logical approach since partitions are [generally] contained on separate nodes.

Writing to HDFS is inherently distributed, whereas writing to local filesystem would involve at least 1 of 2 problems:

1) writing to node-local filesystem would mean files on all different nodes (5 files on 1 node, 7 files on another, etc)

2) writing to driver's filesystem would require sending all the executors' results to the driver akin to running collect

You can write to the driver local filesystem using traditional I/O operations built-into languages like Python or Scala.

Relevant SOs:

How to write to CSV in Spark

Save a spark RDD to the local file system using Java

Spark (Scala) Writing (and reading) to local file system from driver

Garren S
  • 5,552
  • 3
  • 30
  • 45