Update table by doing GROUP BY on existing rows SPARK - SQL?

Question

I am saving tables from Spark SQL using MySQL as my storage engine. My table looks like

+-------------+----------+
|        count|      date|
+-------------+----------+
|           72|2017-09-08|
|           84|2017-09-08|
+-------------+----------+

I want to UPDATE the table by adding the count using GROUP BY and dropping the individual rows. So my output should be like

 +-------------+----------+
 |        count|      date|
 +-------------+----------+
 |          156|2017-09-08|
 +-------------+----------+

Is it a right expectation and if possible, how it could be achieved using Spark SQL ?

score 3 · Answer 1 · answered Sep 10 '17 at 09:00

3

Before you write the table to MYSQL, apply the following logic in your spark dataframe/dataset

import org.apache.spark.sql.functions._
df.groupBy("date").agg(sum("count").as("count"))

And write the transformed dataframe to MYSQL.

answered Sep 10 '17 at 09:00

Ramesh Maharjan

41,071
6
69
97

This makes sense when have to add new rows to a table, but i want to group by existing rows. – jdk2588 Sep 10 '17 at 10:52
@jdk2588, it is suitable for both the cases you are talking about. You can read the existing rows to spark and apply the logic i explained and delete the table in mysql and write a fresh table. :) thanks for the upvote :) – Ramesh Maharjan Sep 10 '17 at 14:21
deleting a table and applying update is an expensive operation, because if there are only some entries which needs to be overwritten, then the whole table has to be dropped (even for entries which are not required to be updated) – jdk2588 Sep 11 '17 at 03:59
In my opinion, update of table and dropping rows would be expensive than deleting and writing new table. – Ramesh Maharjan Sep 11 '17 at 04:52

jophab · Answer 2 · 2017-09-10T08:24:48.767

2

Soln 1

In MySQL, you can make use of TEMPORARY TABLE to store the results after grouping.

Then truncate the original table. Now insert data from temporary table to original table.

CREATE TEMPORARY TABLE temp_table

AS

(SELECT SUM(count) as count, [date] from table_name GROUP BY [date]);

TRUNCATE TABLE table_name;

INSERT INTO table_name (count,[date])

SELECT (count,[date]) from temp_table;

DROP TEMPORARY TABLE temp_table;

Soln 2

Update the rows using following query.

UPDATE table_name t 
INNER JOIN 
(SELECT sum(count) as [count], [date] FROM table_name GROUP BY [date]) t1 
ON t.[date] = t1.[date]
SET t.[count] = t1.[count]

Assuming that the table has a unique column named uid,

DELETE t1 FROM table_name t1, table_name t2 
WHERE t1.uid > t2.uid AND t1.[date] = t2.[date]

Please refer this SO question to see more about deleting duplicate rows.

edited Sep 10 '17 at 08:24

answered Sep 10 '17 at 05:43

jophab

5,356
14
41
60

The records will be increasing so i cannot drop the table and create new table, it will be an expensive operation. – jdk2588 Sep 10 '17 at 06:01
I think it would be faster than first updating and then deleting. Does your table have unique column? – jophab Sep 10 '17 at 08:12
This is an aggregated view, which does not have a unique column – jdk2588 Sep 10 '17 at 11:07

Update table by doing GROUP BY on existing rows SPARK - SQL?

2 Answers2