Transfer huge dataframe to SQL Server table

Question

I'm trying to bulk copy a DF with 1.5B rows into a SQL Server table. It's just 5 columns, but the number of rows is massive. I'm trying this method.

import com.microsoft.azure.sqldb.spark.bulkcopy.BulkCopyMetadata
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._

val bulkCopyConfig = Config(Map(
  "url"               -> "mysqlserver.database.windows.net",
  "databaseName"      -> "MyDatabase",
  "user"              -> "username",
  "password"          -> "*********",
  "dbTable"           -> "dbo.Clients",
  "bulkCopyBatchSize" -> "100000",
  "bulkCopyTableLock" -> "true",
  "bulkCopyTimeout"   -> "600"
))

df.bulkCopyToSqlDB(bulkCopyConfig)

I'm following the example from the link below.

https://learn.microsoft.com/en-us/azure/sql-database/sql-database-spark-connector

It keeps running and running, for hours, but never copies anything from the DF to the table. Thoughts? Suggestions? Thanks.

BTW, if I try to copy a much smaller DF, with around 1.5M rows, it works just fine.

Does the transactions complete or do you cancel it? Do you have any indexes on the table? — Mr Zach, Oct 27 '19 at 20:50
suggest you look at sys.dm_exec_requests while doing the import to see what is happening. Alternatively, you can start an XE or profiler session and watch the commands completing to get a sense of what is happening. — Conor Cunningham MSFT, Oct 27 '19 at 21:21
Install `whoisactive` on the destination database and see what's going on. https://github.com/amachanic/sp_whoisactive — mauridb, Oct 29 '19 at 00:11
What kind of indexes are you using on target table? Rowstore/Columnstore/Clustered Columnstore/Heap? — marcin2x4, Nov 20 '22 at 00:45

score 0 · Answer 1 · answered Oct 29 '19 at 09:22

Per my experience, I think your issue was caused by the value of bulkCopyBatchSize parameter, as the offical document Managing Bulk Copy Batch Sizes said as below, but not the larger batch size, the better performance.

The number of rows making up a batch can have significant performance effects when bulk copying a large number of rows. The recommendations for batch size depend on the type of bulk copy being performed.

When bulk copying to SQL Server, specify the TABLOCK bulk copy hint and set a large batch size.

When TABLOCK is not specified, limit batch sizes to less than 1,000 rows.

As reference, please see the answer of the SO thread What is the recommended batch size for SqlBulkCopy? to set a 5000 value for bulkCopyBatchSize parameter to try again. I think it will get better performance than your current one, because more batch size will cost more time to send the network packets (include more retry packets) in a batch to wait for a successful response.

this has a scala limitation. Any idea on pythonic way of achieving this ? — Chetan_Vasudevan, Sep 07 '20 at 21:21

Transfer huge dataframe to SQL Server table

1 Answers1