Using hadoop cluster with different machine configuration

Question

I have two linux machines, both with different configuration

Machine 1: 16 GB RAM, 4 Virtual Cores and 40 GB HDD (Master and Slave Machine)

Machine 2: 8 GB RAM, 2 Virtual Cores and 40 GB HDD (Slave machine)

I have set up a hadoop cluster between these two machines.
I am using Machine 1 as both master and slave.
And Machine 2 as slave.

I want to run my spark application and utilise as much as Virtual Cores and memory as possible but I am unable to figure out what settings.

My spark code looks something like:

conf = SparkConf().setAppName("Simple Application")
sc = SparkContext('spark://master:7077')
hc = HiveContext(sc)
sqlContext = SQLContext(sc)
spark = SparkSession.builder.appName("SimpleApplication").master("yarn-cluster").getOrCreate()

So far, I have tried the following:

When I process my 2 GB file only on Machine 1 (in local mode as Single node cluster), it uses all the 4 CPUs of the machine and completes in about 8 mins.
When I process my 2 GB file with cluster configuration as above, it takes slightly longer than 8 mins, though I expected, it would take less time.

What number of executors, cores, memory do I need to set to maximize the usage of cluster?
I have referred below articles but because I have different machine configuration in my case, not sure what parameter would fit best.

Apache Spark: The number of cores vs. the number of executors

Any help will be greatly appreciated.

Well, obviously you can't use more resources than the smallest node in the cluster... Also, you do not need a HIveContext and SQLContext... Both are deprecated in favor of `SparkSession.sql`, and you set your app name twice? Pass the conf into the session builder, or only use the session builder — OneCricketeer, Jan 31 '18 at 13:28
Hi, i tried using `--num-executors 1 --executor-cores 2`, post which i could see all the virtual cores are being used during processing. Probably that seems to be the best configuration. And thanks for correction about the appName being set twice. — Arun, Jan 31 '18 at 14:08
Do i not need hiveContext to query hive table? Or do you mean i can use `SparkSession. sql` to to query hive table? — Arun, Jan 31 '18 at 14:10
It's not needed. You will need to enable Hive support, but yes https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables — OneCricketeer, Jan 31 '18 at 14:18
You could increase the executor memory, as well, if necessary — OneCricketeer, Jan 31 '18 at 14:20
Thanks. I will try using spark.sql... Thanks for your prompt response. — Arun, Jan 31 '18 at 14:23

score 0 · Answer 1 · answered Jan 31 '18 at 14:45

When I process my 2 GB file with cluster configuration as above, it takes slightly longer than 8 mins, though I expected, it would take less time.

Its not clear where your file is stored.

I see you're using Spark Standalone mode, so I'll assume it's not split on HDFS into about 16 blocks (given block size of 128MB).

In that scenario, your entire file will processed at least once in whole, plus the overhead of shuffling that data amongst the network.

If you used YARN as the Spark master with HDFS as the FileSystem, and a splittable file format, then the computation would go "to the data", which you could expect quicker run times.

As far as optimal settings, there's tradeoffs between cores&memory and amount of executors, but there's no magic number for a particular workload and you'll always be limited by the smallest node in the cluster, keeping in mind the memory of the Spark driver and other processes on the OS should be accounted for when calculating sizes

Using hadoop cluster with different machine configuration

1 Answers1