0

I have access to a reporting dataset (that I don't control) that we retrieve daily from a cloud service and store in a mysql db to run advanced reporting and report combining locally with 3rd party data visualization software.

The data often has duplicate values on an id field that create problems when joining with other tables for data analysis.

For example:

 +-------------+----------+------------+----------+
 | workfile_id | zip_code |    date    |   total  |
 +-------------+----------+------------+----------+
 |       78002 |    90210 | 2016-11-11 | 2010.023 |
 |       78002 |    90210 | 2016-12-22 |  427.132 |
 +-------------+----------+------------+----------+

Workfile_id is duplicated because this is the same job, but additional work on the job was performed in a different month than the original work. Instead of the software creating another workfile id for the job, the same is used.

Doing joins with other tables on workfile_id is problematic when more than one of the same id is present, so I was wondering if it is possible to do one of two things:

  1. Make duplicate workfile_id's unique. Have sql append a number to the workfile id when a duplicate is found. The first duplicate (or second occurrence of the same workfile id) would need to get a .01 appended to the end of the workfile id. Then later, if another duplicate is inserted, it would need to auto increment the appended number, say .02, and so on with any subsequent duplicate workfile_id. This method would work best with our data but I'm curious how difficult this would be for the server from a performance perspective. If I could schedule the alteration to take place after the data is inserted to speed up the initial data insert, that would be ideal.
  2. Sum total columns and remove duplicate workfile_id row. Have a task that identifies duplicate workfile_ids and sums the financial columns of the duplicates, replacing the original total with new sum and deleting the 'new row' after the columns have been added together. This is more messy from a data preservation perspective, but is acceptable if the first solution isn't possible.

My assumption is that there will be significant overhead to have the server compare new workfile_id values to all existing worlfile_id values each time data is inserted, but our dataset is small and new data is only inserted once daily, at 1:30am, and it also should be feasible to keep the duplicate workfile_id searching to rows inserted within the last 6 mo.

Is finding duplicates in a column (workfile_id) and appending an auto-incrementing value onto the workfile_id possible?

EDIT: I'm having trouble getting my trigger to work based on sdsc81's answer below. Any ideas?

DELIMITER //

CREATE TRIGGER append_subID_to_workfile_ID_salesjournal
AFTER INSERT
   ON salesjournal FOR EACH ROW

BEGIN
   SET @COUNTER = ( SELECT (COUNT(*)-1) FROM salesjournal WHERE workfile_id = NEW.workfile_id );
   IF @COUNTER > 1 THEN
       UPDATE salesjournal SET workfile_id = CONCAT(workfile_id, @COUNTER) WHERE id = NEW.id;
   END IF;

END;//

DELIMITER ;

It's hard to know if the trigger isn't working at all, or if just the code in the trigger isn't working. I get no errors on insert. Is there any way to debug trigger errors?

tephlon
  • 1
  • 2

1 Answers1

0

Well, everything is posible ;)

You dont control the dataset but you can modifify the database, right? Then you could use a trigger after every insert of a new value, and update it, if its duplicate. Something like:

SET @COUNTER = ( SELECT (COUNT(*)-1) FROM *your_table* WHERE workfile_id = NEW.workfile_id );
IF @COUNTER > 1 THEN
    UPDATE *your_table* SET workfile_id = CONCAT(workfile_id, @COUNTER) WHERE some_unique_id = NEW.some_unique_id;
END IF;

If there are only one insert a day, and there is defined an index over the workfile_id value, then it shouldn't be any problem for your server at all.

Also, you could implement the second solution, doing:

DELIMITER //

CREATE TRIGGER append_subID_to_workfile_ID_salesjournal
AFTER INSERT ON salesjournal FOR EACH ROW

BEGIN
SET @COUNTER = ( SELECT (COUNT(*)-1) FROM salesjournal WHERE workfile_id = NEW.workfile_id );
IF @COUNTER > 1 THEN
    UPDATE salesjournal SET total = total + NEW.total WHERE workfile_id = NEW.workfile_id AND id <> NEW.id;
    DELETE FROM salesjournal WHERE id = NEW.id;
END IF;

END;//

DELIMITER ;

Hope this helps.

sdsc81
  • 570
  • 8
  • 18
  • If you need the value to added to be of 2 digits, then you could use something like `SET @COUNTER = CONCAT( '0', @COUNTER );` and then `SET @COUNTER = SUBSTR( @COUNTER, LENGTH( @COUNTER - 1));`, before the `UPDATE your_table ...` instruction. It's messy, but should do the trick. Cheers – sdsc81 Dec 29 '16 at 18:30
  • Yes, I'm sorry if I wasn't clear. I control entirely the destination database. I'm not a db admin by trade so I was unclear how something like this would be accomplished. I'm swamped today but I will test your method ASAP and choose it as correct if it works out. Thanks for the quick response. – tephlon Dec 30 '16 at 15:15
  • Hi! I was double checking and see that i corrected the query. By the way, are that all your fields in the table? or do you have an unike *id* that identifies each row? (I mean, like an autoincrement value).. Because that would be needed to work properly. Let me know. – sdsc81 Dec 30 '16 at 18:08
  • Sorry, no, those aren't all the fields. I was just trying to keep it simple. I do have a unique id field that is auto incrementing. I've updated my trigger with your updated code and I'm testing it now. Fingers crossed. – tephlon Dec 30 '16 at 18:39
  • Great! Then you have to set that **ID** in the **where** clause; `WHERE unique_id = NEW.unique_id`. Otherwise it will update all the fields containing the *workfile_id* value. – sdsc81 Dec 30 '16 at 18:58
  • I've updated my trigger on my db (and in the code block of my "EDIT" in my original question), but it still doesn't seem to be working. Is there something special I have to do to make this work in a trigger that I'm missing? – tephlon Dec 30 '16 at 21:00
  • Should be working for new inserts. You could check if each indivual querys work properly. I. E. the select count and the update.. Does the querys work ok? – sdsc81 Dec 31 '16 at 01:26
  • Sorry it took so long to reply-- I've been distracted by other projects. After attempting an insert elsewhere I was able to get it to throw an error, and the error and issue is described here:http://stackoverflow.com/questions/15300673/mysql-error-cant-update-table-in-stored-function-trigger-because-it-is-already Apparently, you can't trigger an update after an update on the same table because of concern of recursion. But auto-increment apparently doesn't work on 'before' triggers. I'm going to test the more complex example offered in the link above and I'll let you know how it goes. – tephlon Jan 24 '17 at 16:35
  • Hi, have you changed the trigger? Because you can't update on `AFTER UPDATE`, but you should be able to update on `AFTER INSERT`. – sdsc81 Jan 25 '17 at 18:02
  • I have updated the answer, i'm thinking this would be the right one. – sdsc81 Jan 26 '17 at 19:46
  • I'm still using an 'after insert' trigger, but it still throws: "Error Code: 1442. Can't update table 'salesjournal' in stored function/trigger because it is already used by statement which invoked this stored function/trigger." It still appears I can't have an UPDATE statement in a TRIGGER if that TRIGGER is applied to the same table as the UPDATE statement in the TRIGGER is updating. Wow, what a weird sentence. Does that make any sense at all? I'm still trying to implement other examples given in other questions but not having any luck. – tephlon Jun 23 '17 at 20:38