0

Is this even possible? I can't seem to find any proper guide to set it up. Everything I find is given with instruction on SISS with which I am not familiar at all. Other options I find are involving SOUNDEX() which is not relevant for what I want to do since the language of the data is not english.

I'll add some info about what I want to get accomplished:

I am trying to find duplicates in a database, but they are not total-match duplicates. I can do this after fetching all the records from my database with php and levenshtein() but this method scales very poorly as it demands going through every record and matching it to all the untested records - so i end up with (records.length)! number of minimal steps which isn't efficient.

I would like to do the comparison within the actual SQL query to be more efficient and be able to run this duplicates test on a large scale database (Over 100,000 records).

I feel like code samples are irrelevant to this question so adding none, let me know if I should add something specific.

CodeAt30
  • 874
  • 6
  • 17
  • How do you expect us to suggest any solution, if we do not even know what you are trying to achieve? There are several different ways of doing fuzzy matching from a simple like to complex algorithms. Which of these I should suggest I have no clue because your question does not provide any clues. – Shadow Apr 08 '18 at 10:54

1 Answers1

3

you can create a levenshtein function and use cross join

select 
t1.id,t2.id,
levenshtein(t1.action_date,t2.action_date) as levenshtein
from yourTable t1
CROSS join yourTable t2 on t1.id <> t2.id;

levenshtein function

 DELIMITER //

CREATE DEFINER=`root`@`localhost` FUNCTION `levenshtein`( s1 varchar(255), s2 varchar(255) ) RETURNS int(11)
    DETERMINISTIC
BEGIN 
  DECLARE s1_len, 
    s2_len, 
    i, 
    j, 
    c, 
    c_temp, 
    cost          int; 
  declare s1_char char; 
  -- max strlen=255 
  declare cv0, 
    cv1 varbinary(256); 
  set s1_len = char_length(s1), 
    s2_len = char_length(s2), 
    cv1 = 0x00, 
    j = 1, 
    i = 1, 
    c = 0; 
  if s1 = s2 THEN 
  RETURN 0; 
elseif s1_len = 0 THEN 
  RETURN s2_len; 
elseif s2_len = 0 THEN 
  RETURN s1_len; 
  else 
  WHILE j <= s2_len do 
  SET cv1 = concat(cv1, unhex(hex(j))), 
    j = j + 1; 
end WHILE;
WHILE i <= s1_len do 
SET s1_char = substring(s1, i, 1), 
  c = i, 
  cv0 = unhex(hex(i)), 
  j = 1;WHILE j <= s2_len do 
SET c = c + 1;IF s1_char = Substring(s2, j, 1) then 
SET cost = 0; 
else 
SET cost = 1;END IF;SET c_temp = conv(hex(substring(cv1, j, 1)), 16, 10) + cost;IF c > c_temp then
SET c = c_temp;END IF;SET c_temp = conv(hex(substring(cv1, j+1, 1)), 16, 10) + 1;IF c > c_temp then
SET c = c_temp;END IF;SET cv0 = concat(cv0, unhex(hex(c))), 
  j = j + 1;END WHILE;SET cv1 = cv0, 
  i = i + 1;END WHILE;END IF;RETURN c;
END;//

DELIMITER ;

this function is not from me. see: How to add levenshtein function in mysql?

sample

MariaDB [test]> select * from yourTable;
+----+-------------+-------------+---------------------+-------------+
| id | ministry_id | building_id | action_date         | action_type |
+----+-------------+-------------+---------------------+-------------+
|  1 |          14 |        1653 | 2011-12-23 11:22:33 |           1 |
|  2 |          14 |        1653 | 2012-02-29 11:51:12 |           2 |
|  3 |          14 |        1653 | 2013-06-25 11:29:22 |           1 |
|  4 |          14 |        1653 | 2017-05-15 17:01:37 |           2 |
+----+-------------+-------------+---------------------+-------------+
4 rows in set (0.00 sec)

MariaDB [test]> select 
    -> t1.id,t2.id,
    -> levenshtein(t1.action_date,t2.action_date) as levenshtein
    -> from yourTable t1
    -> CROSS join yourTable t2 on t1.id <> t2.id;
+----+----+-------------+
| id | id | levenshtein |
+----+----+-------------+
|  2 |  1 |           7 |
|  3 |  1 |           7 |
|  4 |  1 |           9 |
|  1 |  2 |           7 |
|  3 |  2 |           6 |
|  4 |  2 |           8 |
|  1 |  3 |           7 |
|  2 |  3 |           6 |
|  4 |  3 |           8 |
|  1 |  4 |           9 |
|  2 |  4 |           8 |
|  3 |  4 |           8 |
+----+----+-------------+
12 rows in set (0.09 sec)

MariaDB [test]> 
Bernd Buffen
  • 14,525
  • 2
  • 24
  • 39
  • But *where* do I create the function? That's what i'm not understanding. Where you I put it in? – CodeAt30 Apr 09 '18 at 18:26
  • @CodeAt30 - Select your Database with **use yourDatabase;** then copy my function from **DELIMITER //** to **DELIMITER ;** and execute them. you must do it only one time. After this you can use the function – Bernd Buffen Apr 09 '18 at 18:34