11

i have a cluster on AWS with 2 slaves and 1 master. All instances are of type m1.large. I'm running spark version 1.4. I'm benchmarking the performance of spark over 4m data coming from red shift. I fired one query through pyspark shell

    df = sqlContext.load(source="jdbc", url="connection_string", dbtable="table_name", user='user', password="pass")
    df.registerTempTable('test')
    d=sqlContext.sql("""

    select user_id from (

    select -- (i1)

        sum(total),

        user_id

    from

        (select --(i2)

            avg(total) as total,

            user_id

        from

                test

        group by

            order_id,

            user_id) as a

    group by

        user_id

    having sum(total) > 0

    ) as b
"""
)

When i do d.count(), the above query takes 30 sec when df is not cached and 17sec when df is cached in memory.

I'm expecting these timings to be closer to 1-2s.

These are my spark configurations:

spark.executor.memory 6154m
spark.driver.memory 3g
spark.shuffle.spill false
spark.default.parallelism 8

rest is set to its default values. Can any one see what i'm missing here ?

Arpit
  • 12,767
  • 3
  • 27
  • 40
  • 1
    What is the performance for ONLY the Inner query ? – Geek Jul 29 '15 at 07:23
  • @Geek i1 took 13 sec (no cache). – Arpit Jul 29 '15 at 08:18
  • `.cache()` only tells spark to cache it once it has been demanded by a spark action. The first time is always slow. Subsequent accesses should be faster. Also I noticed spark on EC2 wasn't activating all the cores on a `c3-8xlarge` because the executor-cores wasn't set. – Paul Jul 31 '15 at 10:30
  • I don't have any problem in caching, but problem is query is not fast enough even after cache. It takes 17sec for completion after caching the dataframe. – Arpit Jul 31 '15 at 12:04
  • If I were you I would check the physical plan for such query (use `.explain()` to print it) and look at Spark Web UI to check what parts take the most time. Does the table you use have any other columns? Caching only ones needed for the query might help. – krcz Aug 05 '15 at 01:13
  • @krcz the most time was taken by shuffle stages and gc. – Arpit Aug 05 '15 at 08:11
  • My suggestion would be to reconsider whether Spark should be used to load the data. It depends on the specific case, and what you will do with the data afterwards, but native drivers are faster in general to retrieve data. As @krcz suggested I will take a look at the plan as you may be suffering the effect of having two planners (the one in spark and the one in redshift). Additionally, I am not familiar with the capabilites of the spark-redshift integration, which may cause some operation to be slower if things like filter-push-down is not already available. – Daniel H. Aug 06 '15 at 11:26
  • @DanielH. On redshift the query took only 1.2 sec in returning the result to spark. I aso tried with postgres but the result was same. What do you men by native drivers ? – Arpit Aug 06 '15 at 11:54
  • I was referring to the way you retrieve data from the datastore, in this case RedShift. From my experience with other datastores, and obviously depending on what are you planning to do with the data afterwards you may try to retrieve data from redshift with the already existing drivers in Java for example and then transform that into an SchemaRDD that you can process in spark afterwards. Bear in mind that this highly depends on the maturity of the spark-redshift connector, and may change overtime. Hope it helps. – Daniel H. Aug 06 '15 at 12:11
  • Let's forget redshift for a sec, I'm getting the same results with postgres also. – Arpit Aug 06 '15 at 14:12
  • Also, the docs have some interesting tuning options: http://spark.apache.org/docs/latest/sql-programming-guide.html#performance-tuning – Ryan Apr 10 '16 at 16:41

2 Answers2

4

This is normal, don't except Spark to run in a few milli-secondes like mysql or postgres do. Spark is low latency compared to other big data solutions like Hive, Impala... you cannot compare it with classic database, Spark is not a database where data are indexed!

Watch this video: https://www.youtube.com/watch?v=8E0cVWKiuhk

They clearly put Spark here:

Spark not so low latency

Did you try Apache Drill? I found it a bit faster (I use it for small HDFS JSON files, 2/3Gb, much faster than Spark for SQL queries).

Thomas Decaux
  • 21,738
  • 2
  • 113
  • 124
1
  1. Set default.parallelism to 2
  2. Start spark with --num-executor-cores 8
  3. Modify this part

df.registerTempTable('test') d=sqlContext.sql("""...

to

df.registerTempTable('test') sqlContext.cacheTable("test") d=sqlContext.sql("""...

Boggio
  • 1,128
  • 11
  • 16