PySpark & JDBC: When should I use spark with JDBC?

Question

I'm not very familiar with Spark, so please forgive me if this is navie.

I have an HDFS data lake to work with, and the data can be queried through Hive and Presto, Impala and Spark (in the cluster).

However, the Spark does not have built-in access control, and for security reason, I can only use Hive/Presto for query.

My questions

Can I install spark locally (e.g. my laptop), and use JDBC to connect data source (Hive or Presto) as in https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html ? So I can query data using PySpark's dataframe syntax and Python instead of SQL, which is more productive for me.
How is this different from reading data using Pandas? In Pandas, the data is load directly into my laptop, so I can only load ~1M rows of data, otherwise the loading would take too long. Will Spark (installed locally) push query, limit and transformation to the data source? Otherwise there is no use for this approach.
What is the speed difference between using Presto (in cluster) and Spark (local machine) w/ jdbc to Presto?

Thanks!

score 3 · Accepted Answer · answered Oct 21 '19 at 04:49

Yes, you can install spark locally and use JDBC to connect to your databases. Here is an function to help you connect to my-sql, which you can generalize to any JDBC source by changing the JDBC connection string:

def connect_to_sql(
        spark, jdbc_hostname, jdbc_port, database, data_table, username, password
):
    jdbc_url = "jdbc:mysql://{0}:{1}/{2}".format(jdbc_hostname, jdbc_port, database)

    connection_details = {
        "user": username,
        "password": password,
        "driver": "com.mysql.cj.jdbc.Driver",
    }

    df = spark.read.jdbc(url=jdbc_url, table=data_table, properties=connection_details)
    return df

Spark is better at handling big data than Pandas even on local machines but it comes with a performance overhead due to parallelism and distributed computing. It will definitely serve your purpose on the cluster, but local mode should be used for development only.
Rest assured, Spark (installed locally) will push query, limit and transformation limitations and even handle it better if done correctly. Search, Sort, Filter operations are going to be expensive since the DF is a non-indexed distributed data structure.
Unaware of the speed difference between Presto and Spark, have not tried a comparison.

Hope this helps.

Note: Performance improvement is not guaranteed on local machine even with optimal parallel workload. It does not provide opportunity for distribution.

Hi, is there any docs for this? from https://stackoverflow.com/questions/32573991/does-spark-predicate-pushdown-work-with-jdbc it seems query condition push-down is not well supported. — ZK Zhao, Oct 21 '19 at 12:18
@cqcn1991 Lol, I thought we were talking about pushing the query performance — pissall, Oct 21 '19 at 12:40

PySpark & JDBC: When should I use spark with JDBC?

1 Answers1