How to use azure-sqldb-spark connector in pyspark

Question

I want to write around 10 GB of data everyday to Azure SQL server DB using PySpark.Currently using JDBC driver which takes hours making insert statements one by one.

I am planning to use azure-sqldb-spark connector which claims to turbo boost the write using bulk insert.

I went through the official doc: https://github.com/Azure/azure-sqldb-spark. The library is written in scala and basically requires the use of 2 scala classes :

import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._

val bulkCopyConfig = Config(Map(
  "url"               -> "mysqlserver.database.windows.net",
  "databaseName"      -> "MyDatabase",
  "user"              -> "username",
  "password"          -> "*********",
  "databaseName"      -> "MyDatabase",
  "dbTable"           -> "dbo.Clients",
  "bulkCopyBatchSize" -> "2500",
  "bulkCopyTableLock" -> "true",
  "bulkCopyTimeout"   -> "600"
))

df.bulkCopyToSqlDB(bulkCopyConfig)

Can it be implemented in used in pyspark like this (using sc._jvm):

Config = sc._jvm.com.microsoft.azure.sqldb.spark.config.Config
connect= sc._jvm.com.microsoft.azure.sqldb.spark.connect._

//all config

df.connect.bulkCopyToSqlDB(bulkCopyConfig)

I am not an expert in Python. Can anybody help me with the complete snippet to get this done.

How to use azure-sqldb-spark connector in pyspark? I know it can be done in scala but my entire project is in python. — Ajay Kumar, Oct 29 '18 at 07:36
I think we don't have any examples yet please subscribe to this issue - https://github.com/Azure/azure-sqldb-spark/issues/20 — Sundeep Pidugu, Oct 30 '18 at 06:26
Hey @AjayKumar How do you overcome performance issue in puspark? i am currently running into performance issue. Can you help me? — Tharunkumar Reddy, Sep 18 '19 at 12:46
@AjayKumar The project in the github link you referenced is not being actively maintained anymore. Instead use the project in [this link](https://github.com/microsoft/sql-spark-connector). Microsoft encourages us to use this project that has support for Python and R bindings, an easier-to use interface to bulk insert data, and many other improvements. — nam, Apr 23 '22 at 17:05

score 7 · Answer 1 · answered Mar 15 '19 at 10:07

7

The Spark connector currently (as of march 2019) only supports the Scala API (as documented here). So if you are working in a notebook, you could do all the preprocessing in python, finally register the dataframe as a temp table, e. g. :

df.createOrReplaceTempView('testbulk')

and have to do the final step in Scala:

%scala
//configs...
spark.table("testbulk").bulkCopyToSqlDB(bulkCopyConfig)

answered Mar 15 '19 at 10:07

Hauke Mallow

2,887
3
11
29

This works well. Before the connector is implemented in Pyspark, this workaround should do the job. – hui chen Dec 30 '19 at 15:00
@huichen do you know how to add 'ldap" authorization in? – Maria Nazari Jan 10 '20 at 23:57
You mean add ldap auth to the cluster? You can try to add it in the init script so every time when the cluster is started, it will be installed. – hui chen Jan 12 '20 at 07:19
@huichen can you elaborate this please – Chetan_Vasudevan Sep 09 '20 at 05:38

How to use azure-sqldb-spark connector in pyspark

1 Answers1

Linked