I have some problems when running a spark script in a multicore machine. I am on a Ubuntu 16.04 machine. I've installed spark as follows:
(1) I've download spark-2.4.0-bin-hadoop2.7.tgz and unpacked it on /usr/local/spark. I've added spark's bin folder to PATH by adding "export PATH=/usr/local/spark/bin:$PATH" into .bashrc file. I've installed java using apt-get install: apt-get install oracle-java8-installer and apt-get install oracle-java8-set-default
(2) I am using the following script:
from pyspark import SparkContext, SparkConf
if __name__ == "__main__":
conf = SparkConf().setAppName("word count").setMaster("local[*]")
sc = SparkContext(conf = conf)
lines = sc.textFile("/path/to/file/word_count.txt")
words = lines.flatMap(lambda line: line.split(" "))
wordCounts = words.countByValue()
executor_count = len(sc._jsc.sc().statusTracker().getExecutorInfos()) - 1
cores_per_executor = int(sc.getConf().get('spark.executor.cores','1'))
#for word, count in wordCounts.items():
# print("{} : {}".format(word, count))
print("Number of executors :" + str(executor_count))
print("Cores per executor :" + str(cores_per_executor))
(3) when running it using:
$ spark-submit WordCount.py
I get:
18/11/17 10:42:12 WARN Utils: Your hostname, se8-HVM-domU resolves to a loopback address: 127.0.1.1; using 192.168.91.61 instead (on interface eth0)
18/11/17 10:42:12 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/11/17 10:42:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Number of executors :0
Cores per executor :1
(4) The number of the cores of the machine is 32.
# cat /proc/cpuinfo | grep processor | wc -l
32
If I am correctly understanding the output in (3), the script is using just 1 core - even thought I configured the SparkContext to "local[*]" (uses as many cores as avaliable).
Now the simple question: how to make the script uses more than 1 core ?