I'm running a PySpark application in:
- emr-5.8.0
- Hadoop distribution:Amazon 2.7.3
- Spark 2.2.0
I'm running on a very large cluster. The application reads a few input files from s3. One of these is loaded into memory and broadcast to all the nodes. The other is distributed to the disks of each node in the cluster using the SparkFiles functionality. The application works but performance is slower than expected for larger jobs. Looking at the log files I see the following warning repeated almost constantly:
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
It tends to happen after a message about accessing the file that was loaded into memory and broadcasted. Is this warning something to warn about? How to avoid it?
Google searching brings up several people dealing with this warning in native Hadoop applications, but I've found nothing about it in Spark or PySpark and can't figure out how those solutions would apply for me.
Thanks!