I am running a spark analytics application and reading MSSQL Server table (whole table) directly using spark jdbc
. Thes table have more than 30M records but don't have any primary key column or integer column. Since the table don't have such column I cannot use the partitionColumn
, hence it is taking too much time while reading the table.
val datasource = spark.read.format("jdbc")
.option("url", "jdbc:sqlserver://host:1433;database=mydb")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("dbtable", "dbo.table")
.option("user", "myuser")
.option("password", "password")
.option("useSSL", "false").load()
Is there any way to improve the performance is such case and use the parallelism while reading data from relational database sources ( The source could be Oracle, MSSQL Server, MySQL, DB2).