I created hive table using 10 GB csv file using Hue. Then tried to run SQL query. While processing data it is talking long time more than 2 hr. Can anybody tell me whether this is the spark problem ?? or I did something wrong.
I tried all the possible combinations like changing number of executors, cores and executors memory.
--driver-memory 10g\ --num-executors 10\ --executor-memory 10g\ --executor-cores 10\
I tested by changing num-executors like 10, 15,20,50,100 and same for memory and cores.
Talking about the cluster it has 6 nodes 380+ cores and 1TB memory.
My SQL query: select percentile_approx(x1, array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) as x1_quantiles, percentile_approx(x2, array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) as x2_quantiles, percentile_approx(x3, array(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) as x3_quantiles from mytest.test1
code is pretty straightforward
val query= args(0)
val sparkConf= new SparkConf().setAppName("Spark Hive")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.cacheTable(" mytest.test1")
val start = System.currentTimeMillis()
val testload=sqlContext.sql(query)
testload.show()
val end = System.currentTimeMillis()
println("Time took " + (end-start) + " ms")