Spark read all tables from MSSQL and then apply SQL query

Question

I have Spark 3 cluster setup. I have some data in SQL server and its size is around 100 GB. I have to perform different queries on this data from Spark cluster. I have connected to SQL server from Spark via JDBC and run a sample query. Now, instead of query execution on SQL server, I want to run query after moving/copying data to Spark cluster (as SQL server is taking too much time, hence we are using Spark). There are around 10 tables in the database.

What are the possible ways to achieve this ?

If I directly execute query from Spark to SQL server, then it takes too much time as its a bottle neck (running on one system). Is there any better way for this

Did you find an answer? – Matt Andruff Feb 03 '22 at 19:02 — Matt Andruff, Feb 03 '22 at 19:02

score 1 · Answer 1 · answered Jan 06 '22 at 16:07

You can read from the tables in parallel. (If your database can safely handle the load and isn't your production/externally facing database)

var df = spark.read.
format("jdbc").
option("url", "jdbc:db2://<DB2 server>:<DB2 port>/<dbname>").
option("user", "<username>").
option("password", "<password>").
option("dbtable", "<your table>").
option("partitionColumn", "DBPARTITIONNUM(<a column name>)").
option("lowerBound", "<lowest partition number>").
option("upperBound", "<largest partition number>").
option("numPartitions", "<number of partitions>").
load()

This will help to speed up the read. If you need more information refer to the documentation.

If that doesn't speed things up enough consider using Change Data Capture tooling to stream the data directly into HDFS so your spark cluster can use it.

Spark read all tables from MSSQL and then apply SQL query

1 Answers1