setting tuning parameters of a spark job

Question

I'm relatively new to spark and I have a few questions related to the tuning optimizations with respect to the spark submit command.

I have followed : How to tune spark executor number, cores and executor memory?

and I understand how to utilise maximum resources out of my spark cluster.

However, I was recently asked how to define the number of cores, memory and cores when I have a relatively smaller operation to do as if I give maximum resources, it is going to be underutilised .

For instance,

if I have to just do a merge job (read files from hdfs and write one single huge file back to hdfs using coalesce) for about 60-70 GB (assume each file is of 128 mb in size which is the block size of HDFS) of data(in avro format without compression), what would be the ideal memory, no of executor and cores required for this? Assume I have the configurations of my nodes same as the one mentioned in the link above.

I can't understand the concept of how much memory will be used up by the entire job provided there are no joins, aggregations etc.

score 0 · Answer 1 · answered Jul 14 '19 at 21:41

The amount of memory you will need depends on what you run before the write operation. If all you're doing is reading data combining it and writing it out, then you will need very little memory per cpu because the dataset is never fully materialized before writing it out. If you're doing joins/group-by/other aggregate operations all of those will require much ore memory. The exception to this rule is that spark isn't really tuned for large files and generally is much more performant when dealing with sets of reasonably sized files. Ultimately the best way to get your answers is to run your job with the default parameters and see what blows up.

setting tuning parameters of a spark job

1 Answers1