i have a cluster on AWS with 2 slaves and 1 master. All instances are of type m1.large. I'm running spark version 1.4. I'm benchmarking the performance of spark over 4m data coming from red shift. I fired one query through pyspark shell
df = sqlContext.load(source="jdbc", url="connection_string", dbtable="table_name", user='user', password="pass")
df.registerTempTable('test')
d=sqlContext.sql("""
select user_id from (
select -- (i1)
sum(total),
user_id
from
(select --(i2)
avg(total) as total,
user_id
from
test
group by
order_id,
user_id) as a
group by
user_id
having sum(total) > 0
) as b
"""
)
When i do d.count(), the above query takes 30 sec when df
is not cached and 17sec when df
is cached in memory.
I'm expecting these timings to be closer to 1-2s.
These are my spark configurations:
spark.executor.memory 6154m
spark.driver.memory 3g
spark.shuffle.spill false
spark.default.parallelism 8
rest is set to its default values. Can any one see what i'm missing here ?