I am Reading data from S3 using Spark Streaming, and I want to update stream data into Amazon Redshift. Data with same primary key exist then that row should be updated and new rows should be inserted. Can someone please suggest the right approach to do it considering the performance?
val ssc = new StreamingContext(sc, Duration(30000))
val lines = ssc.textFileStream("s3://<path-to-data>/YYYY/MM/DD/HH")
lines.foreachRDD(
x => {
val normalizedRDD = processRDD(x)
val df = spark.createDataset(normalizedRDD)
//TODO: How to Update/Upsert data in Redshift?
}
)