1

Is there any service in EMR or way where I can see a progress bar(or elapsed time) when I submit a job of creating parquet files to S3?

The code:

df.write.partitionBy("date").mode("append").parquet("s3n://uk-adp-vault/semasio/output")
Prasad Khode
  • 6,602
  • 11
  • 44
  • 59
ultraInstinct
  • 4,063
  • 10
  • 36
  • 53
  • From my experience, you should avoid appending new data this way. The runtime is ~linear with the amount of the existing data on s3. see this: http://stackoverflow.com/questions/40830152/how-to-avoid-reading-old-files-from-s3-when-appending-new-data When using s3-dist-cp I can see the progress in the resource manager (http://:8088/cluster) – Niros Feb 12 '17 at 07:23
  • What Niros is suggesting is correct, nevertheless the job progress is in the Spark UI – eliasah Feb 12 '17 at 18:25

1 Answers1

0

You can go to the ResourceManager using the 8088 port on EMR. This will show you the memory utilization.

From there you can navigate to ApplicationMaster which is the spark UI for the cluster. That will show you the progress of that job with details of each task.

Chirag
  • 335
  • 2
  • 3
  • 13