Running Spark jobs from S3 produce random Input Size values

Question

I just started to run Spark jobs using S3 as input and EC2 as instance for the cluster. I'm using Cloudera, Spark 2.3.0, Dataframe, Jupyter notebook, python 2.

It was very stange for me to see random input size values for the job stages and its tasks. By random I mean that that values for these metrics and increasing and decreasing without any logic. It was never happend to me to see something like this what using HDFS as the input (from an inhouse cluster)

I created a video with this behavior : https://youtu.be/MQJ3DU-zOvs

Code :

dataframe = spark.\
                read.\
                parquet("s3n://path_to_input")

daframe.\
    groupBy("column1").\
    agg(
        count("*").alias("alias1")
    ).\
    write.\
    parquet("s3n://path_to_s3", mode="overwrite")

Do you encounter this type of issue or do you know what is the cause? Thanks

Possible duplicate of [Spark throws java.io.IOException: Failed to rename when saving part-xxxxx.gz](https://stackoverflow.com/questions/51050591/spark-throws-java-io-ioexception-failed-to-rename-when-saving-part-xxxxx-gz) — stevel, Jul 11 '18 at 12:12
thanks @SteveLoughran for suggestion, but it doesn't seem a duplicate to me. My input for the query was saved few days ago and there is no simultaneously work on it, so the fact that S3 is eventually consistent shouldn't be an issue. — Tudor Lapusan, Jul 12 '18 at 07:08

score 1 · Answer 1 · answered Jul 11 '18 at 12:11

1

If you are chaining together queries using S3 as the intermediate store, the fact that S3 is eventually consistent means that second query may get a listing which omits recently created files (and can include recently deleted ones). The normal commit operations (which list directory trees and rename them) have this from the outset.

answered Jul 11 '18 at 12:11

stevel

12,567
1
39
50

I have just one query (I have updated the code from the post). The input for the query was saved in S3 few days ago, so I think that S3 is eventually consistent is not a problem here. – Tudor Lapusan Jul 12 '18 at 07:06
ok, if its from > a few minutes, its not consistency. But: if the data was generated by spark itself, the job could have been inconsistent during its execution. If external, that's not it. Download locally and try again – stevel Jul 13 '18 at 12:44

Running Spark jobs from S3 produce random Input Size values

1 Answers1