Spark write Dataframe to SQL Server Table with Insert Identity On

Question

I have a Spark Dataframe that I want to push to an SQL table on a remote server. The table has an Id column that is set as an identity column. The Dataframe I want to push also has as Id column, and I want to use those Ids in the SQL table, without removing the identity option for the column.

I write the dataframe like this:

df.write.format("jdbc") \
      .mode(mode) \
      .option("url", jdbc_url) \
      .option("dbtable", table_name) \
      .option("user", jdbc_username) \
      .option("password", jdbc_password) \
      .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
      .save()

But I get the following response:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage 41.0 (TID 41, 10.1.0.4, executor 0): java.sql.BatchUpdateException: Cannot insert explicit value for identity column in table &#39Table&#39; when IDENTITY_INSERT is set to OFF.

I have tried to add a query to the writing like:

query = f"SET IDENTITY_INSERT Table ON;"
df.write.format("jdbc") \
      .mode(mode) \
      .option("url", jdbc_url) \
      .option("query", query) \
      .option("dbtable", table_name) \
      .option("user", jdbc_username) \
      .option("password", jdbc_password) \
      .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
      .save()

But that just throws an SQL syntax error:

IllegalArgumentException: Both 'dbtable' and 'query' can not be specified at the same time.

Or if I try to run a read with the query first:

com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword &#39;SET&#39;.

This must be because it only supports SELECT statements.

Is it possible to do in Spark, or would I need to use a different connector and combine the setting of the insert identity on, together with regular insert into statements?

I would prefer a solution that allowed me to keep writing through the Spark context. But I am open to other solutions.

related https://stackoverflow.com/questions/1334012/cannot-insert-explicit-value-for-identity-column-in-table-table-when-identity. If you are using SQL Server, maybe you could try to set IDENTITY_INSERT ON — abiratsis, Jun 29 '21 at 09:39
I have tried but as far as I can tell IDENTITY_INSERT ON only applies for the session. If I run the query through SSMS and then run my write command in spark I get the same error. — Cralle, Jul 01 '21 at 07:11
I think you should execute the command `SET IDENTITY_INSERT tableName ON` on SQL Server itself not in your Spark code — abiratsis, Jul 01 '21 at 13:24
That also does not work, I already try this and I get the same error when running my Spark code. As I mentioned in my previous comment, I ran the query in SSMS (SQL Server Management Studio), so on the SQL Server itself. — Cralle, Jul 08 '21 at 08:46
@Cralle did you find any solution for this issue, I am facing similar problem while trying to insert into a timestamp with default value. I set the identity as ON in SSMS but stil keep getting the error. — NMAK, Aug 19 '21 at 13:34

score 0 · Answer 1 · answered Jun 23 '22 at 13:23

One way to work around this issue is the following:

Save your dataframe as a temporary table in your database.
Set identity insert to ON.
Insert into your real table the content of your temporary table.
Set identity insert to OFF.
Drop your temporary table.

Here's a pseudo code example:

tablename = "MyTable"
tmp_tablename = tablename+"tmp"
df.write.format("jdbc").options(..., dtable=tmp_tablename).save()
columns = ','.join(df.columns)
query = f"""
  SET IDENTITY_INSERT {tablename} ON;
  INSERT INTO {tablename} ({columns})
  SELECT {columns} FROM {tmp_tablename};
  SET IDENTITY_INSERT {tablename} OFF;
  DROP TABLE {tmp_tablename};
"""
execute(query) # You can use Cursor from pyodbc for example to execute raw SQL queries

Spark write Dataframe to SQL Server Table with Insert Identity On

1 Answers1