1

I am very new to Spark and the entire ecosystem so the error my be stupid for the initiated but I have found no support or similar issues being posted.

I have a lot of data (TBs) on an S3 bucket the data is split in to thousands of 100Mb parquet files in sub-dirs. Currently, I just want to query a single file and select some rows. I am running spark (3.0) locally while I learn, with PySpark:

Code looks like this:

spark = SparkSession.builder \
    .master("local") \
    .appName("Test") \
    .getOrCreate()

path = "s3a://BUCKET_NAME/DIR/FILE.gz.parquet"
df = spark.read.parquet(path)

df.printSchema()   # this works
df.show(n=10)      # this works 
df.orderBy("sessionID").show(n=5)   # this works
df.select("sessionID").show(n=5)   # this fails

OrderBy works correctly and quickly shows the first 5 sorted by name. However, the select query fails with:

19/09/13 01:16:28 ERROR TaskContextImpl: Error in TaskCompletionListener
org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 74915373; received: 45265606
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)

So I believe the select operation is not receiving the full data from the S3 bucket, but what can I do about this? And why does OrderBy work?

The following question is a little open ended. The data is organized into sessions that need to get processed in one go. But the rows for each session are scattered through each parquet file and through hundreds of parquet files, meaning hundreds of GB would have to be traversed to piece together a complete session. So I want Spark to sort by session ID. The processing is to be done by a separate C++ library so i will have to pipe the session data out. Processing the whole dataset on a local machine will be intractable, but I can apply some selects to reduce the data down to something like 50Gb which I hope can be churned through in a few hours on a powerful workstation (32core, 64Gb). Is this feasible? How would the design look? Sorry this is vague but the Spark examples are either incredibly simple on a tiny JSON or assume great in-depth knowledge and it is hard to transition form the former to the later.

user1978816
  • 812
  • 1
  • 8
  • 19

1 Answers1

0

After spending hours going through different config options and basically getting no where I started fresh. Turns out the sysadmin had installed the latest Spark 3.0, but there must be issues.

I installed spark 2.4.4, ensured java 8 was selected Pyspark error - Unsupported class file major version 55

And everything works as expected

user1978816
  • 812
  • 1
  • 8
  • 19
  • 2
    i am running Spark 2.4.7, hadoop 2.7.3 built, hadoop-aws 2.7.3 and aws-java-sdk 1.7.4 but still facing this error. Any idea what could be wrong? – Avik Aggarwal May 14 '21 at 18:07