1

I have an use case where I want to update specific row, by any identifier/where clause conditions and update that record on Oracle or SQL Server from databricks.

As i use spark.read.format("jdbc") against any of the databases, I could not easily find a way to update specific rows back to these DBs.

If i use,

df.write.format("jdbc")
.option("url", dbServerJdbcUrl)
.option("user", username)
.option("secret", password)
.option("driver", <either com.microsoft.sqlserver.jdbc.SQLServerDriver or oracle.jdbc.driver.OracleDriver>)
.option("dbTable",<table on the database platform>)
.mode('overwrite') //or other options
.save()

it only overwrites the whole "dbTable" on the database. I could not find a way to have it work by using .option("query", "update statements") so far.

If i tend to write to another temp or parking table, then it becomes 2 stages of work, wherein, i have to go back to the Db platform and have the actual respective tables updated from the parking table.

Another note - when i do the above write, on a table which has millions of rows, and i only want to update handful of them, any of the modes are only just causing more trouble.

  • overwrite - simply makes the millions of rows to lose/overwritten by this handful of data from df.
  • append - either creates dupes or eventually failure due to constraints

Is there any better solution to have the databricks update the specific rows on a database?

Ak777
  • 346
  • 7
  • 18
  • There is not yet an API for DataFrameWriter to do this kind of job (refer to [this](https://stackoverflow.com/a/35640373/11289386)), but you may still loop through the records in the dataframe to create an update statement one by one (which is I think not a quite good option). – Phuri Chalermkiatsakul May 23 '22 at 03:20
  • Thanks for your input, but even if am going to loop through, how does the spark.write() will work with query and that wont be replacing my whole table? – Ak777 May 23 '22 at 04:24
  • What I am thinking is you may loop through the DF for getting values and then update the table via normal SQL script with cx_oracle. I found this will be possible from this [blog](https://medium.com/@srijansahay/connecting-sql-server-oracle-mysql-and-postgresql-from-azure-services-using-python-789e93d879b4). – Phuri Chalermkiatsakul May 23 '22 at 04:37

1 Answers1

0

The code snippet you provided is written in Python and seems to be using the PySpark library to write data to a database using JDBC. However, the snippet is incomplete and missing some necessary information. Here's an example of how you can complete the code snippet:

df.write.format("jdbc") \
  .option("url", dbServerJdbcUrl) \
  .option("user", username) \
  .option("password", password) \
  .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \ # or "oracle.jdbc.driver.OracleDriver" for Oracle
  .option("dbtable", "your_table_name") \
  .mode('overwrite') \
  .save()
  • No. This isn't an answer I am looking for and this is the same as what I've mentioned in my Q. It's additionally having driver configuration explicitly. Have you tried this against an existing db table that may have millions of rows data whereas your df has just 1000. I am expecting updates to those 1000 and not replacing millions with 1000 – Ak777 Jun 25 '23 at 07:26