1

I am trying to remove a manual step of adding an argument to my spark-submit by having my java spark application automatically calculate number of available cores to do partitions on. The hope was to identify a solution to do this programmatically.

I did look at this solution [SO question]Spark: get number of cluster cores programmatically, but am not sure how to do the "EncapsulationViolator" component to allow blockManager.master.getStorageStatus.length - 1 to work in Java. I have also tried sc.getExecutorStorageStatus.length - 1 to no avail. I was able to get number of cores via java.lang.Runtime.getRuntime.availableProcessors, but number of nodes/workers/executors still eludes me.

Hoping someone has a suggestion on how to get number of executors beyond what has been suggested. I'm in spark 3.0 and writing in java

capt-mac
  • 51
  • 1
  • 3
  • I guess the answer would depend on the type of cluster manager you're using. – mazaneicha Jun 29 '22 at 18:48
  • @mazaneicha I'm using JavaSparkContext on an EMR cluster which I think is Yarn – capt-mac Jun 29 '22 at 18:51
  • @mazaneicha is there any reason why this would not work? sparkContext.sc().statusTracker().getExecutorInfos().length - 1. When running in deployed cluster it returns 0, even though I have 5 worker nodes. – capt-mac Jul 01 '22 at 15:45
  • @mazaneicha Maybe my understanding is wrong. I have a cluster and want to communicate available qty of cores for my application to leverage for map and reduce steps of my pipelines. At the moment I am manually passing in an argument to set number of cores to leverage for parallel processing, when I do the manual setting, I see in the console the executors are being created/registered and tasks are being run, but it happens a couple seconds after it is checking getexecutorinfos(). I guess I'm not sure when to check for executor info, I thought it was just checking the cluster – capt-mac Jul 05 '22 at 16:35
  • Okay @capt-mac, I might have misunderstood the question. – mazaneicha Jul 06 '22 at 13:06

1 Answers1

0

I ended up using AWS cli shell script to query the cluster for this information

first had to get the clusterid

aws emr list-clusters --active --query "Clusters[0].Id" --output text

then had to get number of running instances:

aws emr describe-cluster --cluster-id "$CLUSTER_ID" --output json --query "Cluster.InstanceGroups[0].RunningInstanceCount"

I then called this from within the master node of the cluster and multiplied it by

java.lang.Runtime.getRuntime().availableProcessors(); 

This got me the total number of partitions programmatically

capt-mac
  • 51
  • 1
  • 3