I have a spark application. Which is reading data from oracle into data-frames. then i am converting it into javaRDD and savingAsTExt to hdfs. I am running this on yarn on 8 node cluster. When i see the job on spark-webUI. I can see it is getting only 2 containers and 2 cpus.
I am reading 5 tables from oracle. Each table is having around 500 millions of rows. Data size is about of 80GB.
spark-submit --class "oracle.table.join.JoinRdbmsTables" --master yarn --deploy-mode cluster oracleData.jar
Also i used:
spark-submit --class "oracle.table.join.JoinRdbmsTables" --master yarn --deploy-mode cluster --num-executors 40 oracleDataWrite.jar
I could see 40 containers get assigned to job. However, I could only see 1 active task on web-ui.
I have another spark application. Which is loading a 20GB text file, then i am doing some processing on data and saving to hdfs. I can see it is getting assigned with around 64 containers and cpus.
spark-submit --class "practice.FilterSave" --master yarn --deploy-mode cluster batch-spark-0.0.1-SNAPSHOT-jar-with-dependencies.jar mergedData.json
The difference between them is::-->> for second application i am using sparkJavaContext while for first i am using SQLContext to use data-frame.
NOTE: I AM NOT GETTNG ANY-ERROR FOR BOTH.
Here is the piece of code i am using to load 5 table
Map<String, String> options = new HashMap();
options.put("driver", "oracle.jdbc.driver.OracleDriver");
options.put("url", "XXXXXXX");
options.put("dbtable", "QLRCR2.table1");
DataFrame df=sqlcontext.load("jdbc", options);
//df.show();
JavaRDD<Row> rdd=df.javaRDD();
rdd.saveAsTextFile("hdfs://path");
Map<String, String> options2 = new HashMap();
options2.put("driver", "oracle.jdbc.driver.OracleDriver");
options2.put("url", "XXXXXXX");
options2.put("dbtable", "QLRCR2.table2");
DataFrame df2=sqlcontext.load("jdbc", options);
//df2.show();
JavaRDD<Row> rdd2=df2.javaRDD();
rdd2.saveAsTextFile("hdfs://path");
ANy help will be appreciated :)