2

I am saving tables from Spark SQL using MySQL as my storage engine. My table looks like

+-------------+----------+
|        count|      date|
+-------------+----------+
|           72|2017-09-08|
|           84|2017-09-08|
+-------------+----------+

I want to UPDATE the table by adding the count using GROUP BY and dropping the individual rows. So my output should be like

 +-------------+----------+
 |        count|      date|
 +-------------+----------+
 |          156|2017-09-08|
 +-------------+----------+

Is it a right expectation and if possible, how it could be achieved using Spark SQL ?

jophab
  • 5,356
  • 14
  • 41
  • 60
jdk2588
  • 782
  • 1
  • 9
  • 23

2 Answers2

3

Before you write the table to MYSQL, apply the following logic in your spark dataframe/dataset

import org.apache.spark.sql.functions._
df.groupBy("date").agg(sum("count").as("count"))

And write the transformed dataframe to MYSQL.

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • This makes sense when have to add new rows to a table, but i want to group by existing rows. – jdk2588 Sep 10 '17 at 10:52
  • @jdk2588, it is suitable for both the cases you are talking about. You can read the existing rows to spark and apply the logic i explained and delete the table in mysql and write a fresh table. :) thanks for the upvote :) – Ramesh Maharjan Sep 10 '17 at 14:21
  • deleting a table and applying update is an expensive operation, because if there are only some entries which needs to be overwritten, then the whole table has to be dropped (even for entries which are not required to be updated) – jdk2588 Sep 11 '17 at 03:59
  • In my opinion, update of table and dropping rows would be expensive than deleting and writing new table. – Ramesh Maharjan Sep 11 '17 at 04:52
2

Soln 1

In MySQL, you can make use of TEMPORARY TABLE to store the results after grouping.

Then truncate the original table. Now insert data from temporary table to original table.

CREATE TEMPORARY TABLE temp_table

AS

(SELECT SUM(count) as count, [date] from table_name GROUP BY [date]);

TRUNCATE TABLE table_name;

INSERT INTO table_name (count,[date])

SELECT (count,[date]) from temp_table;

DROP TEMPORARY TABLE temp_table;

Soln 2

Update the rows using following query.

UPDATE table_name t 
INNER JOIN 
(SELECT sum(count) as [count], [date] FROM table_name GROUP BY [date]) t1 
ON t.[date] = t1.[date]
SET t.[count] = t1.[count]

Assuming that the table has a unique column named uid,

DELETE t1 FROM table_name t1, table_name t2 
WHERE t1.uid > t2.uid AND t1.[date] = t2.[date]

Please refer this SO question to see more about deleting duplicate rows.

jophab
  • 5,356
  • 14
  • 41
  • 60