The classes FSDataInputStream
and FSDataOutputStream
are classes in hadoop-common
FSDataInputStream
adds high performance APIs for reading data at specific offsets into byte arrays (PositionedReadable
) or bytebuffers. these are used extensively in libraries reading files where the reads are not sequential, more random IO. parquet, orc etc. FileSystem implementations often provide highly efficient implementations of these. These apis are not relevant for simple file copy unless you really are trying for maximum performance, have opened the same source file to multiple streams and are fetching blocks in parallel across them. Distcp does things like this, which is why it can overload networks if you try hard.
full specification, including of PositionedReadable: fsdatainputstream
FSDataOutputStream
doesn't add that much to the normal DataOutputStream
; the most important interface is Syncable
, whose hflush and hsync calls have specific guarantees about durability, as in "when they return, the data has been persisted to HDFS or the other filesystem, for hsync, all the way to disk". If you are implementing a database like HBase, you need these and those guarantees. If you aren't, you really don't. In recent hadoop releases, trying to use them when writing to S3 will simply log a warning message telling you to stop it. It's not a real file system, after all.
full specification, including of syncable: outputstream
copying files in spark at scale
IF you want to copy files efficiently in spark, open the source files and dest files with a buffer of a few MB, read into the buffer, then write it back. you can distribute this work across the cluster for better parallelism, as this example does: https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/cloudera/spark/cloud/applications/CloudCp.scala
- If you just want to copy one or two files, just do it in a single process, maybe multithreaded.
- If you are really seeking performance against s3, take the list of files to copy, schedule the largest few files first so they don't hold you at the end, then randomize the rest of the list to avoid creating hotspots in the s3 bucket.