0

A colleague mentioned that the spark dataframeWriter class writes to a temporary location in s3 and then copies to the desired s3 location once complete. I wanted to understand this behavior more but cannot locate the source code describing this behavior. I'm been looking here:

https://github.com/apache/spark/blob/c0632cec04e5b0f3fb3c3f27c21a2d3f3fbb4f7e/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

Is the behavior described in this post what spark performs when writing to s3?

If the behavior is what should occur, a link to the location(s) where this code path exists and a description of why this behavior is defaulted (and obfuscated from an end user) would be helpful to understand.

Lucas Roberts
  • 1,252
  • 14
  • 17
  • 1
    It is what happens [with basic commit algorithm](https://stackoverflow.com/q/46882683/10465355) but that's [not the only approach available out there](https://issues.apache.org/jira/browse/HADOOP-13786). – 10465355 Jan 23 '19 at 23:27
  • @user10465355 thank you for the pointers, this is very informative to me. I don't see the connection in the code call path from the DataFrameWrite.scala to the basic commit algorithm you've linked in your comment? If that path is clear to you, a description of the path and the comment here (perhaps with a bit of explanation) would constitute an answer. – Lucas Roberts Jan 25 '19 at 02:49
  • Take a look here - https://github.com/apache/spark/blob/cd0a08361e2526519e7c131c42116bf56fa62c76/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L121-L131 – 10465355 Jan 25 '19 at 10:30

0 Answers0