0

I have a spark application. Which is reading data from oracle into data-frames. then i am converting it into javaRDD and savingAsTExt to hdfs. I am running this on yarn on 8 node cluster. When i see the job on spark-webUI. I can see it is getting only 2 containers and 2 cpus.

I am reading 5 tables from oracle. Each table is having around 500 millions of rows. Data size is about of 80GB.

spark-submit  --class "oracle.table.join.JoinRdbmsTables"  --master yarn --deploy-mode cluster  oracleData.jar

Also i used:

spark-submit --class "oracle.table.join.JoinRdbmsTables" --master yarn --deploy-mode cluster --num-executors 40 oracleDataWrite.jar

I could see 40 containers get assigned to job. However, I could only see 1 active task on web-ui.

I have another spark application. Which is loading a 20GB text file, then i am doing some processing on data and saving to hdfs. I can see it is getting assigned with around 64 containers and cpus.

spark-submit  --class "practice.FilterSave"  --master yarn --deploy-mode cluster  batch-spark-0.0.1-SNAPSHOT-jar-with-dependencies.jar mergedData.json

The difference between them is::-->> for second application i am using sparkJavaContext while for first i am using SQLContext to use data-frame.

NOTE: I AM NOT GETTNG ANY-ERROR FOR BOTH.

Here is the piece of code i am using to load 5 table

Map<String, String> options = new HashMap();
options.put("driver", "oracle.jdbc.driver.OracleDriver");
options.put("url", "XXXXXXX");
options.put("dbtable", "QLRCR2.table1");
DataFrame df=sqlcontext.load("jdbc", options);
//df.show();
JavaRDD<Row> rdd=df.javaRDD();
rdd.saveAsTextFile("hdfs://path");

Map<String, String> options2 = new HashMap();
options2.put("driver", "oracle.jdbc.driver.OracleDriver");
options2.put("url", "XXXXXXX");
options2.put("dbtable", "QLRCR2.table2");
DataFrame df2=sqlcontext.load("jdbc", options);
//df2.show();
JavaRDD<Row> rdd2=df2.javaRDD();
rdd2.saveAsTextFile("hdfs://path"); 

ANy help will be appreciated :)

cjstehno
  • 13,468
  • 4
  • 44
  • 56
Mandeep Lohan
  • 12
  • 1
  • 5

1 Answers1

0

The number of executors when running on yarn is set by setting --num-executors N. Note that this does NOT mean you will get N executors, only that N will be requested from Yarn. The amount you can actually get depends on the amount of resources you request per executor. For example, if each node has 25GB dedicated to Yarn ( yarn-site.xml yarn.nodemanager.resource.memory-mb ) and you have 8 nodes, and no other application is running on Yarn, it makes sense to request 8 executors with ~20GB. Notice that on top of what you request with --executor-memory, Spark adds an overhead of 10% ( the default ) so you can't ask for the whole 25GB. More or less similar is --execturo-cores ( yarn-site.xml yarn.nodemanager.resource.cpu-vcores ).

The second question regarding the amount of tasks is a separate thing, check out this good explanation on how stages are split into tasks

Community
  • 1
  • 1
Harel Gliksman
  • 734
  • 1
  • 7
  • 19
  • Agreed with you. Thanx for answering. However, I am of aware of this thing. If you consider my second application, there i am not mentioning any num-executers in command. It get assigned based on the size of input. But for first application it is not getting enough and it gets only 2. So i gave a random number of executers. but the job is running sequentially. I dont know why :( – Mandeep Lohan Jul 27 '16 at 08:55