Identifying similar fields in a database (but not duplicates)

Question

I'm dealing with a querying issue I'm struggling to work around. I have a database of names. What I'm looking to do is figure out those who have multiple names in the database tied to the same ID, where those names are very similar to each other:

ID                          Name
-------------               ----------
123ABC                      Joe Smith

123ABC                      Joseph Smith

345XYZ                      Michael Johnson

345XYZ                      MikeJohnson

678LMN                      Suzyjones

678LMN                      Suzanne Mary Jones

So I'm looking to build a query that can identify these people. Anyone have any suggestions or advice? Obviously, it can be quite tricky because we aren't deal with straight duplicates, but rather small, nuanced changes.

did you see http://stackoverflow.com/a/38513900/5221944 ? this one addresses those nuances (applicable for BigQuery) and easily can be ported to your new example. btw - were you able to implement it in that your previous question? — Mikhail Berlyant, Jul 28 '16 at 18:14
@MikhailBerlyant First of all, thank you for the effort you made in putting that together. That said, I would have no idea how to execute it. While I know BigQuery well, I do not know it at that level. Regex, for example, is like reading an alien langue to me. — wizkids121, Jul 28 '16 at 19:43
we are here for you to help in learning! ;o) not in doing your work! try at least to adopt and share so we will help — Mikhail Berlyant, Jul 28 '16 at 19:50

score 0 · Answer 1 · answered Jul 28 '16 at 18:23

0

Do a self join where the ID matches and the name does not:

select t1.ID, t1.NAME, t2.NAME
from your_table t1
join your_table t2
  on t1.ID = t2.ID
 and t1.NAME <> t2.NAME

answered Jul 28 '16 at 18:23

msheikh25

576
3
9

Right but that isn't what I'm looking for. It's about identifying IDs who have two different names in the database that are similar to each other. I know how to find the IDs, but it's finding those with the example types of names that I'm having trouble flagging down. – wizkids121 Jul 28 '16 at 18:25

score 0 · Answer 2 · answered Jul 28 '16 at 18:29

You can achieve this in multiple ways, I would suggest you to go through the route of group by clause.

The below query assumes that, you will have only record in your table incase there is a name attached to the ID.

;WITH CTE AS
(
SELECT ID 
FROM <yourTable>
group by ID 
HAVING COUNT(1) > 1
)
SELECT T.* 
FROM CTE C
JOIN <yourTable> T
ON C.id - T.ID

If you have multiple rows with the same name in the same table, then you just need to apply the distinct clause before hand.

score 0 · Answer 3 · answered Jul 29 '16 at 00:52

Check below - should work for you
Note WHERE similarity > -1 at the end of query - by setting value instead of -1 you can control similarity threshold. the closer to 1 the more similar pairs you want to capture. Closer to 0 - more pairs to capture !

SELECT ID, Name1, Name2, similarity FROM 
JS( // input table
(
  SELECT one.ID AS ID, one.Name AS Name1, two.Name AS Name2
  FROM YourTable AS one
  JOIN YourTable AS two ON one.ID = two.ID
  HAVING Name1 < Name2
) ,
// input columns
ID, Name1, Name2,
// output schema
"[{name: 'ID', type:'string'},
  {name: 'Name1', type:'string'},
  {name: 'Name2', type:'string'},
  {name: 'similarity', type:'float'}]
",
// function
"function(r, emit) {

  var _extend = function(dst) {
    var sources = Array.prototype.slice.call(arguments, 1);
    for (var i=0; i<sources.length; ++i) {
      var src = sources[i];
      for (var p in src) {
        if (src.hasOwnProperty(p)) dst[p] = src[p];
      }
    }
    return dst;
  };

  var Levenshtein = {
    /**
     * Calculate levenshtein distance of the two strings.
     *
     * @param str1 String the first string.
     * @param str2 String the second string.
     * @return Integer the levenshtein distance (0 and above).
     */
    get: function(str1, str2) {
      // base cases
      if (str1 === str2) return 0;
      if (str1.length === 0) return str2.length;
      if (str2.length === 0) return str1.length;

      // two rows
      var prevRow  = new Array(str2.length + 1),
          curCol, nextCol, i, j, tmp;

      // initialise previous row
      for (i=0; i<prevRow.length; ++i) {
        prevRow[i] = i;
      }

      // calculate current row distance from previous row
      for (i=0; i<str1.length; ++i) {
        nextCol = i + 1;

        for (j=0; j<str2.length; ++j) {
          curCol = nextCol;

          // substution
          nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
          // insertion
          tmp = curCol + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // deletion
          tmp = prevRow[j + 1] + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }

          // copy current col value into previous (in preparation for next iteration)
          prevRow[j] = curCol;
        }

        // copy last col value into previous (in preparation for next iteration)
        prevRow[j] = nextCol;
      }

      return nextCol;
    }

  };

  var the_Name1;

  try {
    the_Name1 = decodeURI(r.Name1).toLowerCase();
  } catch (ex) {
    the_Name1 = r.Name1.toLowerCase();
  }

  try {
    the_Name2 = decodeURI(r.Name2).toLowerCase();
  } catch (ex) {
    the_Name2 = r.Name2.toLowerCase();
  }

  emit({ID: r.ID, Name1: the_Name1, Name2: the_Name2,
        similarity: 1 - Levenshtein.get(the_Name1, the_Name2) / the_Name1.length});

  }"
)
WHERE similarity > -1
ORDER BY similarity DESC

you can test it with below example

SELECT ID, Name1, Name2, similarity FROM 
JS( // input table
(
  SELECT one.ID AS ID, one.Name AS Name1, two.Name AS Name2
  FROM (
    SELECT ID, Name FROM
      (SELECT '123ABC' AS ID, 'Joe Smith' AS Name),
      (SELECT '123ABC' AS ID, 'Joseph Smith' AS Name),
      (SELECT '345XYZ' AS ID, 'Michael Johnson' AS Name),
      (SELECT '345XYZ' AS ID, 'MikeJohnson' AS Name),
      (SELECT '678LMN' AS ID, 'Suzyjones' AS Name),
      (SELECT '678LMN' AS ID, 'Suzanne Mary Jones' AS Name),
      (SELECT 'AAA' AS ID, 'Jordan Tigani' AS Name),
      (SELECT 'AAA' AS ID, 'Felipe Hoffa' AS Name),
      (SELECT 'BBB' AS ID, 'Mikhail Berlyant' AS Name),
      (SELECT 'BBB' AS ID, 'Michael Sheldon' AS Name),
  ) AS one
  JOIN (
    SELECT ID, Name FROM
      (SELECT '123ABC' AS ID, 'Joe Smith' AS Name),
      (SELECT '123ABC' AS ID, 'Joseph Smith' AS Name),
      (SELECT '345XYZ' AS ID, 'Michael Johnson' AS Name),
      (SELECT '345XYZ' AS ID, 'MikeJohnson' AS Name),
      (SELECT '678LMN' AS ID, 'Suzyjones' AS Name),
      (SELECT '678LMN' AS ID, 'Suzanne Mary Jones' AS Name),
      (SELECT 'AAA' AS ID, 'Jordan Tigani' AS Name),
      (SELECT 'AAA' AS ID, 'Felipe Hoffa' AS Name),
      (SELECT 'BBB' AS ID, 'Mikhail Berlyant' AS Name),
      (SELECT 'BBB' AS ID, 'Michael Sheldon' AS Name),
  ) AS two
  ON one.ID = two.ID
  HAVING Name1 < Name2
) ,
// input columns
ID, Name1, Name2,
// output schema
"[{name: 'ID', type:'string'},
  {name: 'Name1', type:'string'},
  {name: 'Name2', type:'string'},
  {name: 'similarity', type:'float'}]
",
// function
"function(r, emit) {

  var _extend = function(dst) {
    var sources = Array.prototype.slice.call(arguments, 1);
    for (var i=0; i<sources.length; ++i) {
      var src = sources[i];
      for (var p in src) {
        if (src.hasOwnProperty(p)) dst[p] = src[p];
      }
    }
    return dst;
  };

  var Levenshtein = {
    /**
     * Calculate levenshtein distance of the two strings.
     *
     * @param str1 String the first string.
     * @param str2 String the second string.
     * @return Integer the levenshtein distance (0 and above).
     */
    get: function(str1, str2) {
      // base cases
      if (str1 === str2) return 0;
      if (str1.length === 0) return str2.length;
      if (str2.length === 0) return str1.length;

      // two rows
      var prevRow  = new Array(str2.length + 1),
          curCol, nextCol, i, j, tmp;

      // initialise previous row
      for (i=0; i<prevRow.length; ++i) {
        prevRow[i] = i;
      }

      // calculate current row distance from previous row
      for (i=0; i<str1.length; ++i) {
        nextCol = i + 1;

        for (j=0; j<str2.length; ++j) {
          curCol = nextCol;

          // substution
          nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
          // insertion
          tmp = curCol + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // deletion
          tmp = prevRow[j + 1] + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }

          // copy current col value into previous (in preparation for next iteration)
          prevRow[j] = curCol;
        }

        // copy last col value into previous (in preparation for next iteration)
        prevRow[j] = nextCol;
      }

      return nextCol;
    }

  };

  var the_Name1;

  try {
    the_Name1 = decodeURI(r.Name1).toLowerCase();
  } catch (ex) {
    the_Name1 = r.Name1.toLowerCase();
  }

  try {
    the_Name2 = decodeURI(r.Name2).toLowerCase();
  } catch (ex) {
    the_Name2 = r.Name2.toLowerCase();
  }

  emit({ID: r.ID, Name1: the_Name1, Name2: the_Name2,
        similarity: 1 - Levenshtein.get(the_Name1, the_Name2) / the_Name1.length});

  }"
)
WHERE similarity > -1
ORDER BY similarity DESC

it produces below result

ID          Name1               Name2               similarity   
123ABC      joe smith           joseph smith        0.6666666666666667   
345XYZ      michael johnson     mikejohnson         0.6666666666666667   
678LMN      suzanne mary jones  suzyjones           0.5  
BBB         michael sheldon     mikhail berlyant    0.4666666666666667   
AAA         felipe hoffa        jordan tigani       0.0

Identifying similar fields in a database (but not duplicates)

3 Answers3