I'm not very familiar with Spark, so please forgive me if this is navie.
I have an HDFS data lake to work with, and the data can be queried through Hive and Presto, Impala and Spark (in the cluster).
However, the Spark does not have built-in access control, and for security reason, I can only use Hive/Presto for query.
My questions
Can I install spark locally (e.g. my laptop), and use JDBC to connect data source (Hive or Presto) as in https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html ? So I can query data using PySpark's dataframe syntax and Python instead of SQL, which is more productive for me.
How is this different from reading data using Pandas? In Pandas, the data is load directly into my laptop, so I can only load ~1M rows of data, otherwise the loading would take too long. Will Spark (installed locally) push query, limit and transformation to the data source? Otherwise there is no use for this approach.
What is the speed difference between using Presto (in cluster) and Spark (local machine) w/ jdbc to Presto?
Thanks!