My apologies if this question has already been answered. I did have a look at the archive but I did not find an answer specific to my question.
I am new to Spark. I am trying to run the simple example attached in parallel locally, using spark-2.1.1 in my MacOS Sierra machine. As I have 4 cores and there are 4 tasks each taking 10 seconds, I was hoping to spend in total a bit more than 10 seconds.
I see that each task takes the expected amount of time. But there seems to me only 2 thread of execution. I was expecting 4. As you can see in the code, the value of each tuple is the execution time of the corresponding task.
insight086:pyspark lquesada$ more output/part-00000
(u'1', 10.000892877578735)
(u'3', 10.000878095626831)
insight086:pyspark lquesada$ more output/part-00001
(u'2', 10.000869989395142)
(u'4', 10.000877857208252)
Also the total time this is taking is considerably more than 20 seconds:
total_time 33.2253439426
Thanks in advance for your help!
Cheers, Luis
INPUT FILE:
1
2
3
4
SCRIPT:
from pyspark import SparkContext
import time
def mymap(word):
start = time.time()
time.sleep(10)
et=time.time()-start
return (word, et)
def main():
start = time.time()
sc = SparkContext(appName='SparkWordCount')
input_file = sc.textFile('/Users/lquesada/Dropbox/hadoop/pyspark/input.txt')
counts = input_file.flatMap(lambda line: line.split()) \
.map(mymap) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile('/Users/lquesada/Dropbox/hadoop/pyspark/output')
sc.stop()
print 'total_time',time.time()-start
if __name__ == '__main__':
main()