1

I have a scheduling database, and I would like to find a way to select all future appointments that are similar because we have a lot of clients who double book themselves. I have been looking but I cant find a way to do quite what I want.

For example if I have the following rows I would like to be able to select those with the same language, a time within 15 minutes of each other, and with names that share > 70% of the same characters.

   |Rec_id|Date_time       |Language|App_name |
   |1     |2014-03-15 12:40|Spanish |Ricardo  |
   |2     |2014-03-15 12:45|Spanish |Ricerdu  |
   |3     |2014-03-16 12:45|Tongan  |Tuaffu   |
   |4     |2014-03-17 12:45|Korean  |Kim      |
   |5     |2014-03-18 12:45|German  |Biternof |
   |6     |2014-03-18 12:32|German  |Biterknof|

From the above data the records I would need are 1,2,5, and 6.

June Lewis
  • 355
  • 1
  • 6
  • 28
  • they are not duplicate , they are different app_name ? – echo_Me Feb 25 '14 at 20:52
  • They are not duplicate, but our clients dont want to pay when we schedule 2 staff to be available. This happens on a much much larger scale several times a day. – June Lewis Feb 25 '14 at 21:00
  • 1
    I would also love to know the reason for the down votes. Is there a way I could improve this question? – June Lewis Feb 25 '14 at 21:03
  • No way I could think of, and I don't understand the downvotes either. There I up you a bit, sorry about those bad votes. – Loïc Feb 28 '14 at 16:26

2 Answers2

1

First thing I thought of was Levenshtein but since MySQL has no native support for it - things get little more complex.

This solution isn't optimized or the best solution at all, but I should do the work.

  • I would create a new column, let's call it "Duplicate_for", DEFAULT NULL.
  • I would create a trigger: TRIGGER BEFORE INSERT for this table.
  • I would create a function to calculate the Levenshtein distance for two strings.
  • I would combine TRIGGER and Levenshtein with following query.

Trigger + query itself:

DELIMITER //
CREATE TRIGGER `booking_before_insert` BEFORE INSERT ON `booking` FOR EACH ROW BEGIN
    DECLARE existingId INT(10) DEFAULT NULL;

    SELECT 
        MAX(id) 
    INTO 
        existingId
    FROM 
        booking 
    WHERE   
        booking.dirty_id IS NULL AND
        booking.lang = NEW.lang AND
        booking.created >= DATE_SUB(NOW(), INTERVAL 15 MINUTE) AND 
        (LEVENSHTEIN(booking.name, NEW.name) / LENGTH(booking.name)) < 0.3;

    SET NEW.dirty_id = existingId;
END//
DELIMITER ;

You can read more about Levenshtein from:

Now you can detect the duplicates with Duplicate_for.

Community
  • 1
  • 1
Niko Hujanen
  • 743
  • 5
  • 10
  • Thank you for the suggestion, I have the Levenshtien set up, but I am having some problems with the trigger. How do I declare the existingID in a trigger? – June Lewis Feb 28 '14 at 14:01
  • Added the code for trigger. `booking` is the name of table. – Niko Hujanen Feb 28 '14 at 14:52
  • Thank you @niko-hujanen, for some reason the Levenshtien does not seem to be working though. I am still playing with it. I have verified the stored procedure works, but I seem to get a duplicate flag regardless of what the new name is. – June Lewis Feb 28 '14 at 17:36
  • I think I figured it out. I need to dencrypt the name. – June Lewis Feb 28 '14 at 17:41
0

That's a nice question.. Kept me awake for a long time at Night.. :D :D First of all, you don't need any triggers and/or extra rows for solving this.

This is what I have attempted so far.

SELECT
t3.*
FROM
Table1 t1
INNER JOIN
Table1 t2
ON
(
  t2.Rec_id > t1.Rec_id
  AND t2.Language = t1.Language
  AND ABS( TIMESTAMPDIFF(MINUTE, t1.Date_time, t2.Date_time)) <= 15
  #AND (
  #  SOUNDEX( t1.App_name ) LIKE CONCAT(TRIM(TRAILING '0' FROM SOUNDEX( t2.App_name )), '%') 
  #  OR
  #  SOUNDEX( t2.App_name ) LIKE CONCAT(TRIM(TRAILING '0' FROM SOUNDEX( t1.App_name )), '%')
  #)
)
INNER JOIN
Table1 t3
ON( t1.Rec_id = t3.Rec_id OR t2.Rec_id = t3.Rec_id )
GROUP BY t3.Rec_id

I have commented out the SOUNDEX part.. 70% match is something fuzzy, isn't it.. Try uncommenting the SOUNDEX part of the query to check if it solves the larger problem.

Manu
  • 901
  • 1
  • 8
  • 28