Persist option in Apache Spark

Question

Hi I am new to Apache Spark and I am querying the hive tables using Apache spark sql in java.

And this is my code

    SparkConf sparkConf = new 
SparkConf().setAppName("Hive").setMaster("local");   
   JavaSparkContext ctx = new JavaSparkContext(sparkConf);
    HiveContext sqlContext = new 
org.apache.spark.sql.hive.HiveContext(ctx.sc());
    org.apache.spark.sql.Row[] results = sqlContext.sql("Select * from 
Tablename where Column='Value'").collect();
    org.apache.spark.sql.Row[] results = sqlContext.sql("Select * from 
Tablename where Column='Value1'").collect();

Also I tried running two different queries in the same application and I watched it is making connections each time with hive meta store. How to solve this and also tell me how to use persist option efficiently.

If the queries were unrelated, then it makes sense that the Hive meta store is queried twice. It maybe helps if you post your program containing the queries. — Till Rohrmann, Jul 27 '15 at 06:49
Thanks for ur reply...Another query is nothing but querying the same table with different value of the same column — wazza, Jul 27 '15 at 06:59

score 1 · Accepted Answer · answered Jul 27 '15 at 07:07

1

It might help to call sqlContext.cacheTable("Tablename") before executing the two queries.

According to the docs it does what you're looking for.

Caches the specified table in-memory.

answered Jul 27 '15 at 07:07

Till Rohrmann

13,148
1
25
51

Thanks a lot. Also I have another question regarding this. I am using spark in java here so when I run this for the first time it caches the table it probably take few minutes to get the result but when I run this again for the second time will it be faster then the earlier? – wazza Jul 27 '15 at 07:13
The caching and thus the network I/O will be done every time you run your program. The advantage of caching is that you can reuse intermediate results within the same program. But what might speed up your program a little bit is the JVM warmup of the Spark cluster. – Till Rohrmann Jul 27 '15 at 07:17
Also I have another doubt... I am using Spark sql in web services in which I am passing the different column values for the above queries and how to use the same hive context for every request – wazza Jul 27 '15 at 07:20
I'm not a web service expert but if you're using JAX-WS, then this might help you: http://stackoverflow.com/a/11096654/4815083 – Till Rohrmann Jul 27 '15 at 07:26
...I have an issue in running spark in hadoop multi node cluster can u hel help me in this? – wazza Aug 04 '15 at 08:54
What's your problem. Maybe I can help you. – Till Rohrmann Aug 04 '15 at 11:41
Thank you very much. Please see my post in this link http://stackoverflow.com/questions/31804723/apache-spark-sql-issue-in-multi-node-hadoop-cluster – wazza Aug 04 '15 at 12:52
Caching is not working properly for me. When I cached using above command and when I try to query the data it returns result with usual speed – wazza Aug 11 '15 at 09:47

Persist option in Apache Spark

1 Answers1