Saving an existing item in table from dataFrame

Question

I have a dataframes which have few rows among them some already exists in db. I want to update few columns of existing rows. How can we do that?

I see we have SaveModes: append and override which might serve the purpose but there is a limitation in both the cases.

With append, I am getting primary key error, as this option tries to create a new row in db
With ovverride, I will loose values for the unchanged attributes in the tuple.

Can someone please suggest how can I update few attributes(Columns values) of a row(tuple).?

Where are you trying to save the dataframe? mysql postgressetc — Som, Aug 04 '20 at 05:06
I think creating trigger on the mysql table will help - check if this helps https://stackoverflow.com/a/62696856/4758823 — Som, Aug 04 '20 at 06:14

score 0 · Answer 1 · answered Aug 04 '20 at 05:09

In spark, dataframes are immutable. So you cannot change a value in place. One way would be to read the complete table, make the modification and write back the complete table in overwrite mode. This will take time. If your modifications are always for a particular group, say user id based or date based, then you can write the data based on that column using partitionBy(). Then you can read that partition using .filter() do the modifications and overwrite only that partition using insertInto() - from pyspark 2.3.0 Refer this answer for other versions for pyspark :Overwrite specific partitions in spark dataframe write method

Finally , also you are doing an override operation which is not desirable — Ritika Garg, Aug 04 '20 at 09:21

score 0 · Accepted Answer · answered Aug 04 '20 at 05:42

0

This can be handled in MySql level, The concept is known as upsert.

case when : primary key is new The SQL will insert into MySQL DB as new row

Case when : primary key is existing You can use

INSERT
ON DUPLICATE KEY UPDATE

Which will update the key with the new entries/changes.

Read More here and here.

The ideal way to such use case is, insert your data into a temporary table first in your MySQL DB and post that use a trigger in order to load that data into original table. Call that trigger from spark itself.

answered Aug 04 '20 at 05:42

dsk

1,863
2
10
13

how to invoke trigger from spark? – Ritika Garg Aug 04 '20 at 09:11
first you need to create the connection with your sql table then u can use - stmt.execute("CALL db.trigger_name()"); – dsk Aug 04 '20 at 09:59

Saving an existing item in table from dataFrame

2 Answers2