I want to use the spark DataFrame effectively by reducing the db cycle and memory.
Here I have provided the sample code. (Not the full implementation)
Map<String, String> options = new HashMap<>();
options.put("dbtable", ("select * from TestTable");
//Create the dataframe
DataFrame df1 = sqlContext.read().format("jdbc").options(options).load();
df1.registerTempTable("TestDBFrame");
//Query1
DataFrame df2 =sqlContext.sql("SELECT name FROM TestDBFrame WHERE age >= 10");
//Query2
DataFrame df3 =sqlContext.sql("SELECT name FROM TestDBFrame WHERE age >= 50");
//df2 operation
df2.count
//df3 operation
df3.count
When running query1 and query2, how many time hit to the DB ? Is it hit two times to the DB ?
When we accessing count of df2 and df3 dataframes, based on originally created dataframe, is it execute DB another two times or simply load from memory?
Since I need to optimize the DB cycle and memory, would like to get better explanation on this.