I m trying to measure the performance in Spark depending on the number of executors and cores . The idea is to play with:
spark.conf.set("spark.executor.instances", "x")
spark.conf.set('spark.cores.max', 'x')
to test the impvoe of performance of Spark when I change the number of executors and cores. Data is 1.66GB Twitter files .json I m working wiht a computer hp:
Prozessor: Intel(R) Core(TM) i7-8650U CPU @ 1.90Ghz 2.11GHz // 16 GB RAM
import time
st = time.time()
print("start time: ", st)
#### Code ####
elapsed_time = time.time() - st
print("...Elapsed time SPARK: %.2fs" % elapsed_time)
I discover that the performance barely change if I use in executors 1,3,5
for example
import time
st = time.time()
print("start time: ", st)
spark = SparkSession.builder.appName('Basics').getOrCreate()
spark.conf.set("spark.executor.instances", "1")
spark.conf.set('spark.cores.max', '1')
df = spark.read.json(mount + '/*/*.json.bz2' )
elapsed_time = time.time() - st
print("...Elapsed time SPARK: %.2fs" % elapsed_time)
1: 1 executor, 1 core start time: 1549530285.584573 ...Elapsed time SPARK: 315.52s
2: 3 executor,3 core start time: 1549528358.4399529 ...Elapsed time SPARK: 308.30s
3: 5 executor,5 core start time: 1549528690.1516254 ...Elapsed time SPARK: 289.28s
Are that improve normal? I was expecting something much more significant.