Spark s3 data frame Select fails: ConnectionClosedException Premature end of Content-Length

Question

I am very new to Spark and the entire ecosystem so the error my be stupid for the initiated but I have found no support or similar issues being posted.

I have a lot of data (TBs) on an S3 bucket the data is split in to thousands of 100Mb parquet files in sub-dirs. Currently, I just want to query a single file and select some rows. I am running spark (3.0) locally while I learn, with PySpark:

Code looks like this:

spark = SparkSession.builder \
    .master("local") \
    .appName("Test") \
    .getOrCreate()

path = "s3a://BUCKET_NAME/DIR/FILE.gz.parquet"
df = spark.read.parquet(path)

df.printSchema()   # this works
df.show(n=10)      # this works 
df.orderBy("sessionID").show(n=5)   # this works
df.select("sessionID").show(n=5)   # this fails

OrderBy works correctly and quickly shows the first 5 sorted by name. However, the select query fails with:

19/09/13 01:16:28 ERROR TaskContextImpl: Error in TaskCompletionListener
org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 74915373; received: 45265606
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)

So I believe the select operation is not receiving the full data from the S3 bucket, but what can I do about this? And why does OrderBy work?

The following question is a little open ended. The data is organized into sessions that need to get processed in one go. But the rows for each session are scattered through each parquet file and through hundreds of parquet files, meaning hundreds of GB would have to be traversed to piece together a complete session. So I want Spark to sort by session ID. The processing is to be done by a separate C++ library so i will have to pipe the session data out. Processing the whole dataset on a local machine will be intractable, but I can apply some selects to reduce the data down to something like 50Gb which I hope can be churned through in a few hours on a powerful workstation (32core, 64Gb). Is this feasible? How would the design look? Sorry this is vague but the Spark examples are either incredibly simple on a tiny JSON or assume great in-depth knowledge and it is hard to transition form the former to the later.

The issues is related to spark 3.0 – user1978816 Sep 17 '19 at 07:09 — user1978816, Sep 17 '19 at 07:09

score 0 · Answer 1 · answered Sep 17 '19 at 07:08

0

After spending hours going through different config options and basically getting no where I started fresh. Turns out the sysadmin had installed the latest Spark 3.0, but there must be issues.

I installed spark 2.4.4, ensured java 8 was selected Pyspark error - Unsupported class file major version 55

And everything works as expected

answered Sep 17 '19 at 07:08

user1978816

812
1
8
19

2

i am running Spark 2.4.7, hadoop 2.7.3 built, hadoop-aws 2.7.3 and aws-java-sdk 1.7.4 but still facing this error. Any idea what could be wrong? – Avik Aggarwal May 14 '21 at 18:07

Spark s3 data frame Select fails: ConnectionClosedException Premature end of Content-Length

1 Answers1