I am very new to Spark and the entire ecosystem so the error my be stupid for the initiated but I have found no support or similar issues being posted.
I have a lot of data (TBs) on an S3 bucket the data is split in to thousands of 100Mb parquet files in sub-dirs. Currently, I just want to query a single file and select some rows. I am running spark (3.0) locally while I learn, with PySpark:
Code looks like this:
spark = SparkSession.builder \
.master("local") \
.appName("Test") \
.getOrCreate()
path = "s3a://BUCKET_NAME/DIR/FILE.gz.parquet"
df = spark.read.parquet(path)
df.printSchema() # this works
df.show(n=10) # this works
df.orderBy("sessionID").show(n=5) # this works
df.select("sessionID").show(n=5) # this fails
OrderBy works correctly and quickly shows the first 5 sorted by name. However, the select query fails with:
19/09/13 01:16:28 ERROR TaskContextImpl: Error in TaskCompletionListener
org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 74915373; received: 45265606
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
So I believe the select operation is not receiving the full data from the S3 bucket, but what can I do about this? And why does OrderBy work?
The following question is a little open ended. The data is organized into sessions that need to get processed in one go. But the rows for each session are scattered through each parquet file and through hundreds of parquet files, meaning hundreds of GB would have to be traversed to piece together a complete session. So I want Spark to sort by session ID. The processing is to be done by a separate C++ library so i will have to pipe the session data out. Processing the whole dataset on a local machine will be intractable, but I can apply some selects to reduce the data down to something like 50Gb which I hope can be churned through in a few hours on a powerful workstation (32core, 64Gb). Is this feasible? How would the design look? Sorry this is vague but the Spark examples are either incredibly simple on a tiny JSON or assume great in-depth knowledge and it is hard to transition form the former to the later.