Greeting,
I have created a Spark 2.1.1 cluster in Amazon EC2 with instance type m4.large of 1 master and 5 slaves to start. My PostgreSQL 9.5 database (t2.large) has a table of over 2 billions rows and 7 column that I would like to process. I have followed the direction from Apache Spark website and other various sources on how to connect and process these data.
My problem is that Spark SQL performance is way slower than my database. My sql statement (see below in the code) takes about 21mins in PSQL, but Spark SQL take about 42 min to finish. My main goal is to measure the performance of PSQL vs Spark SQL and so far I am not getting the desire results. I would appreciate the help.
Thank you
I have tried increasing fetchSize from 10000 to 100000, caching the dataframe, increase numpartition to 100, set spark.sql.shuffle to 2000, double my cluster size, and use larger instance type and so far I have not seen any improvements.
val spark = SparkSession.builder()
.appName("Spark SQL")
.getOrCreate();
val jdbcDF = spark.read.format("jdbc")
.option("url", DBI_URL)
.option("driver", "org.postgresql.Driver")
.option("dbtable", "ghcn_all")
.option("fetchsize", 10000)
.load()
.createOrReplaceTempView("ghcn_all");
val sqlStatement = "SELECT ghcn_date, element_value/10.0
FROM ghcn_all
WHERE station_id = 'USW00094846'
AND (ghcn_date >= '2015-01-01' AND ghcn_date <= '2015-12-31')
AND qflag IS NULL
AND element_type = 'PRCP'
ORDER BY ghcn_date";
val sqlDF = spark.sql(sqlStatement);
var start:Long = System.nanoTime;
val num_rows:Long = sqlDF.count();
var end:Long = System.nanoTime;
println("Total Row : " + num_rows);
println("Total Collect Time Lapse : " + ((end - start) / 1000000) + " ms");