Write PySpark Dataframe to SQL DB as batch

Asked Aug 21 '18 at 15:23

Active Sep 10 '20 at 06:25

Viewed 4,489 times

I have a dataframe in PySpark (using Databricks) and I want to write this dataframe to a SQL DB (Azure SQL Database in my case). This works fine except that it seems that this triggers a row-by-row insert into the SQL DB which is of course not feasible for 10M+ rows. Is there any way to force PySpark to use Bulk-Inserts instead?

currently I simply use this command:

df.write.jdbc(url=jdbcUrl, table=targetTable, mode="append", properties=connectionProperties)

The code that gets executed on the SQL side looks like this:

(@P0 int,@P1 bit,@P2 bit,@P3 float,@P4 float,@P5 nvarchar(4000),@P6 int,@P7 int,@P8 int)INSERT INTO dbo.MyTable("Index","Sampling10pct","Sampling1pct","Latitude","Longitude","SessionID","Year","Month","Day") VALUES (@P0,@P1,@P2,@P3,@P4,@P5,@P6,@P7,@P8)

asked Aug 21 '18 at 15:23

Gerhard Brueckl

1

https://github.com/Azure/azure-sqldb-spark – Alper t. Turker Aug 21 '18 at 17:00
thanks, that basically answers my question I just need to use Scala then instead of Python which is OK – Gerhard Brueckl Aug 22 '18 at 08:27
1

The API is not very complex, so it shouldn't require much effort to use it from PySpark (https://stackoverflow.com/q/36023860/8371915) – Alper t. Turker Aug 22 '18 at 09:14

Write PySpark Dataframe to SQL DB as batch

0 Answers0