0

I have a dataframes which have few rows among them some already exists in db. I want to update few columns of existing rows. How can we do that?

I see we have SaveModes: append and override which might serve the purpose but there is a limitation in both the cases.

  1. With append, I am getting primary key error, as this option tries to create a new row in db
  2. With ovverride, I will loose values for the unchanged attributes in the tuple.

Can someone please suggest how can I update few attributes(Columns values) of a row(tuple).?

Ritika Garg
  • 77
  • 1
  • 6

2 Answers2

0

In spark, dataframes are immutable. So you cannot change a value in place. One way would be to read the complete table, make the modification and write back the complete table in overwrite mode. This will take time. If your modifications are always for a particular group, say user id based or date based, then you can write the data based on that column using partitionBy(). Then you can read that partition using .filter() do the modifications and overwrite only that partition using insertInto() - from pyspark 2.3.0 Refer this answer for other versions for pyspark :Overwrite specific partitions in spark dataframe write method

Raghu
  • 1,644
  • 7
  • 19
0

This can be handled in MySql level, The concept is known as upsert.

case when : primary key is new The SQL will insert into MySQL DB as new row

Case when : primary key is existing You can use

INSERT
ON DUPLICATE KEY UPDATE

Which will update the key with the new entries/changes.

Read More here and here.

The ideal way to such use case is, insert your data into a temporary table first in your MySQL DB and post that use a trigger in order to load that data into original table. Call that trigger from spark itself.

dsk
  • 1,863
  • 2
  • 10
  • 13