I just started to run Spark jobs using S3 as input and EC2 as instance for the cluster. I'm using Cloudera, Spark 2.3.0, Dataframe, Jupyter notebook, python 2.
It was very stange for me to see random input size values for the job stages and its tasks. By random I mean that that values for these metrics and increasing and decreasing without any logic. It was never happend to me to see something like this what using HDFS as the input (from an inhouse cluster)
I created a video with this behavior : https://youtu.be/MQJ3DU-zOvs
Code :
dataframe = spark.\
read.\
parquet("s3n://path_to_input")
daframe.\
groupBy("column1").\
agg(
count("*").alias("alias1")
).\
write.\
parquet("s3n://path_to_s3", mode="overwrite")
Do you encounter this type of issue or do you know what is the cause? Thanks