Matching partial words in two different columns

Question

I am working on trying to weed out a certain customer from our database. I've noticed a trend where people fill out their first name with the same name that is partial to how they fill out their company name. So an example would look like:

business_name               first_name
-------------               ----------
locksmith taylorsville      locksmith

locksmith roy               locksmi

locksmith clinton           locks

locksmith farmington        locksmith

These are people I do not want being pulled in a query. They are bad eggs. I'm trying to put together a query with a WHERE statement (presumably) that isolates anyone who has a first name that contains at least a partial match to their business name, but I'm stumped and could use some help.

score 1 · Answer 1 · answered Jul 21 '16 at 18:44

1

You can use LIKE operator:

SELECT * FROM table WHERE business_name NOT LIKE CONCAT(first_name, '%')

% stands for anything.

answered Jul 21 '16 at 18:44

sm21

33
5

This was very helpful. What if I wanted to isolate anyone who's first name contains a space in it? So someone who fills out their first names as say, "Lock Smith"? – wizkids121 Jul 21 '16 at 20:07
You can use `CONTAINS` – sm21 Jul 21 '16 at 20:10

score 0 · Accepted Answer · edited May 23 '17 at 12:19

You can employ similarity based approach
Try code at bottom of answer
It produces result like below

business_name           partial_business_name   first_name  similarity   
locksmith taylorsville  locksmith               locksmith   1.0  
locksmith farmington    locksmith               locksmith   1.0  
locksmith roy           locksmith               locksmi     0.7777777777777778   
locksmith clinton       locksmith               locks       0.5555555555555556

So, you will be able to control what to filter out based on similarity value

** Code **

SELECT business_name, partial_business_name, first_name, similarity FROM 
JS( // input table
(
  SELECT business_name, REGEXP_EXTRACT(business_name, r'^(\w+)') AS partial_business_name, first_name AS first_name FROM 
    (SELECT 'locksmith taylorsville' AS business_name, 'locksmith' AS first_name),
    (SELECT 'locksmith roy' AS business_name, 'locksmi' AS first_name),
    (SELECT 'locksmith clinton' AS business_name, 'locks' AS first_name),
    (SELECT 'locksmith farmington' AS business_name, 'locksmith' AS first_name),
) ,
// input columns
business_name, partial_business_name, first_name,
// output schema
"[{name: 'business_name', type:'string'},
  {name: 'partial_business_name', type:'string'},
  {name: 'first_name', type:'string'},
  {name: 'similarity', type:'float'}]
",
// function
"function(r, emit) {

  var _extend = function(dst) {
    var sources = Array.prototype.slice.call(arguments, 1);
    for (var i=0; i<sources.length; ++i) {
      var src = sources[i];
      for (var p in src) {
        if (src.hasOwnProperty(p)) dst[p] = src[p];
      }
    }
    return dst;
  };

  var Levenshtein = {
    /**
     * Calculate levenshtein distance of the two strings.
     *
     * @param str1 String the first string.
     * @param str2 String the second string.
     * @return Integer the levenshtein distance (0 and above).
     */
    get: function(str1, str2) {
      // base cases
      if (str1 === str2) return 0;
      if (str1.length === 0) return str2.length;
      if (str2.length === 0) return str1.length;

      // two rows
      var prevRow  = new Array(str2.length + 1),
          curCol, nextCol, i, j, tmp;

      // initialise previous row
      for (i=0; i<prevRow.length; ++i) {
        prevRow[i] = i;
      }

      // calculate current row distance from previous row
      for (i=0; i<str1.length; ++i) {
        nextCol = i + 1;

        for (j=0; j<str2.length; ++j) {
          curCol = nextCol;

          // substution
          nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
          // insertion
          tmp = curCol + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // deletion
          tmp = prevRow[j + 1] + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }

          // copy current col value into previous (in preparation for next iteration)
          prevRow[j] = curCol;
        }

        // copy last col value into previous (in preparation for next iteration)
        prevRow[j] = nextCol;
      }

      return nextCol;
    }

  };

  var the_partial_business_name;

  try {
    the_partial_business_name = decodeURI(r.partial_business_name).toLowerCase();
  } catch (ex) {
    the_partial_business_name = r.partial_business_name.toLowerCase();
  }

  try {
    the_first_name = decodeURI(r.first_name).toLowerCase();
  } catch (ex) {
    the_first_name = r.first_name.toLowerCase();
  }

  emit({business_name: r.business_name, partial_business_name: the_partial_business_name, first_name: the_first_name,
        similarity: 1 - Levenshtein.get(the_partial_business_name, the_first_name) / the_partial_business_name.length});

  }"
)
ORDER BY similarity DESC

Was used in How to perform trigram operations in Google BigQuery? and based on https://storage.googleapis.com/thomaspark-sandbox/udf-examples/pataky.js by @thomaspark where Levenshtein's distance is used to measure similarity

score 0 · Answer 3 · answered Jul 22 '16 at 20:39

0

this will do the trick,

select * from TableName where lower(business_name) contains lower(first_name)

use lower() just in case they have upper case letters. Hope it helps.

answered Jul 22 '16 at 20:39

Haipeng Su

2,341
2
15
30

Matching partial words in two different columns

3 Answers3

Linked