Performance of Spark DataFrame with spark sql

Question

I want to use the spark DataFrame effectively by reducing the db cycle and memory.

Here I have provided the sample code. (Not the full implementation)

Map<String, String> options = new HashMap<>();
options.put("dbtable", ("select * from TestTable");

//Create the dataframe
DataFrame df1 = sqlContext.read().format("jdbc").options(options).load();
df1.registerTempTable("TestDBFrame");

//Query1
DataFrame df2 =sqlContext.sql("SELECT name FROM TestDBFrame WHERE age >= 10");

//Query2
DataFrame df3 =sqlContext.sql("SELECT name FROM TestDBFrame WHERE age >= 50");

//df2 operation 
df2.count

//df3 operation 
df3.count

When running query1 and query2, how many time hit to the DB ? Is it hit two times to the DB ?
When we accessing count of df2 and df3 dataframes, based on originally created dataframe, is it execute DB another two times or simply load from memory?

Since I need to optimize the DB cycle and memory, would like to get better explanation on this.

zero323 · Accepted Answer · 2016-02-03T13:57:04.883

When running query1 and query2, how many time hit to the DB ? Is it hit two times to the DB ?

Zero times. Since none of the above triggers an action there is no need to access the database beyond initial metadata fetch when you call load

When we accessing count of df2 and df3 dataframes, based on originally created dataframe, is it execute DB another two times or simply load from memory?

Each action on a DataFrame will access database. If you want to minimize database hits you should consider caching the tables. It won't prevent database access but should minimize unnecessary traffic

You have to remember though that it doesn't provide strong guarantees. Spark SQL includes multiple optimizations when working with external sources. For example df2.count and df3.count will pushdown predicates so there can be no data suitable for caching. You can try to isolate downstream DataFrames as follows:

DataFrame df1 = sqlContext.read().format("jdbc")
  .options(options).load().where(lit(true))
df1.registerTempTable("TestDBFrame");
sqlContext.cacheTable("TestDBFrame");

It should fetch and cache if there is enough memory a complete table on the first access. Just keep in mind that in practice it can be less efficient than letting predicate pushdown do its work.

If you want strong guarantees you should export data from a database before reading it in Spark.

On a side note it looks like subquery you use is missing an alias.

Hi Zero323, Thank you for the clear explanation. I have upvoted your answer. But some of things need to clarify. What is meant by 1. "It won't prevent database access but should minimize unnecessary traffic" . 2. "Just keep in mind that in practice it can be less efficient than letting predicate pushdown do its work." — Ruchira Kariyawasam, Feb 05 '16 at 04:28
1. Caching is more a hint than a guarantee. There can be not enough space, data can be evicted. See for example my answer here http://stackoverflow.com/a/34117788/1560062 2. Spark SQL performs multiple optimizations. It can be much more efficient to load small subset of data than fetch a whole table and filter later. — zero323, Feb 06 '16 at 01:15

Performance of Spark DataFrame with spark sql

1 Answers1