I want to run sql on my parquet data in spark using the following code,
val parquetDF = spark.read.parquet(path)
parquetDF.createOrReplaceTempView("table_name")
val df = spark.sql("select column_1, column_4, column 10 from table_name");
println(df.count())
My question is, Does this code read only the required columns from the disc?
Theoretically the answer should be Yes. But I need an expert opinion because in the case of Jdbc queries (Mysql), the read(spark.read) phase is taking more time when compared to actions(may be relates to connection but not sure). Jdbc code follows,
spark.read.format("jdbc").jdbc(jdbcUrl, query, props).createOrReplaceTempView(table_name)
spark.sql("select column_1, column_4, column 10 from table_name");
df.show()
println(df.count())
If someone can explain the framework flow in both the cases, it will be very helpful.
Spark version 2.3.0
Scala version 2.11.11