Spark in AWS: "S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream"

Question

I'm running a PySpark application in:

emr-5.8.0
Hadoop distribution:Amazon 2.7.3
Spark 2.2.0

I'm running on a very large cluster. The application reads a few input files from s3. One of these is loaded into memory and broadcast to all the nodes. The other is distributed to the disks of each node in the cluster using the SparkFiles functionality. The application works but performance is slower than expected for larger jobs. Looking at the log files I see the following warning repeated almost constantly:

WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.

It tends to happen after a message about accessing the file that was loaded into memory and broadcasted. Is this warning something to warn about? How to avoid it?

Google searching brings up several people dealing with this warning in native Hadoop applications, but I've found nothing about it in Spark or PySpark and can't figure out how those solutions would apply for me.

Thanks!

stevel · Accepted Answer · 2018-01-11T15:30:45.853

18

Ignore it. The more recent versions of the AWS SDK always tell you off when you call abort() on the input stream, even when it's what you need to do when moving around a many-GB file. For small files, yes, reading to the EOF is the right thing to do, but with big files, no.

See: SDK repeatedly complaining "Not all bytes were read from the S3ObjectInputStream

If you see this a lot, and you are working with columnar data formats such as ORC and Parquet, switch the input streams over to random IO over sequential by setting the property fs.s3a.experimental.fadvise to random. This stops it from ever trying to read the whole file, and instead only reading small blocks. Very bad for full file reads (including .gz files), but transforms column IO.

Note, there's a small fix in S3A for Hadoop 3.x on the final close HADOOP-14596. Up to the EMR team whether to backport or not.

+I'll add some text to the S3A troubleshooting docs. The ASF have never shipped a hadoop release with this problem, but if people are playing mix-and-match with the AWS SDK (very brittle), then this may surface

edited Jan 11 '18 at 15:30

answered Jan 11 '18 at 11:02

stevel

12,567
1
39
50

1

adding s3a in place of s3/s3n worked for me. Thanks. – desaiankitb Mar 20 '18 at 05:59
1

Could you please be a bit more precise on what you mean by "but transforms column IO" ? I don't really get what you meant there. – Thundzz Jan 25 '19 at 19:14
1

Parquet and ORC don't store data like a CSV file, in rows "date", "user", "spent", but in columns, where you'd have a few MB of date values, a few MB of the user field, etc. When any SELECT query only wants a small subset of columns, Spark &c only read a small fraction of the file, skipping all the columns you haven't asked for – stevel Jan 27 '19 at 12:10
1

Yeah sorry my question was not precise enough ! (I already know about columnar formats), but it seems to me that even when I don't specify `fs.s3a.experimental.fadvise` to `random`, spark already has the neat behaviour you are mentioning; so my question is "does it really change anything to activate that flag?". Since you said it would "transform column IO", that made me doubt my preconceived idea, and I was wondering if you could shed a bit more light onto that. – Thundzz Jan 27 '19 at 22:04
2

instead of doing one big GET 0-EOF, it does G offset, min(read-len, fs.s3a.readahead). HADOOP-13203 if you are curious. Performance on sequential reads (CSV, gzip) collapses, which is why its not the default. – stevel Jan 28 '19 at 18:03

bsplosion · Answer 2 · 2019-03-29T19:26:55.853

1

Note: This only applies to non-EMR installations as AWS doesn't offer s3a.

Before choosing to ignore the warnings or altering your input streams via settings per Steve Loughran's answer, make absolutely sure you're not using s3://bucket/path notation.

Starting with Spark 2, you should leverage the s3a protocol via s3a://bucket/path, which would likely address the warnings you're seeing (it did for us) and substantially boost the speed of S3 interactions. See this answer for detail on a breakdown of differences.

edited Mar 29 '19 at 19:26

answered Mar 29 '19 at 18:34

bsplosion

2,641
27
38

-you are correct with "pure" ASF releases, but EMR uses s3:// as their prefix...when using that s3 is the one to use as the EMR team don't support s3a – stevel Mar 06 '20 at 11:02
Definitely accurate, hence the note at the top of my answer. It's unfortunate that EMR still doesn't support `s3a` - it's been around for a while now. – bsplosion Mar 10 '20 at 21:20
yeah, you are right. EMR is stuck on Hadoop 2.8, which is EOL. think they've given up. That "Not all bytes were read" had two causes: when you called close() on a non-empty stream it drained rather than aborted it (fix: drain the stream when the #of bytes < threshold) *and* they told you off when you called abort() even though you knew what they were doing. They fixed #2 eventually – stevel Mar 11 '20 at 10:44

Spark in AWS: "S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream"

2 Answers2