1

I'm trying to measure similarity between strings using Dice's Coefficient (aka Pair Similarity) in BigQuery. For a second I thought that I can do that using just standard functions.

Suppose I need to compare "gana" and "gano". Then I would "cook" these two strings upfront into 'ga|an|na' and 'ga|an|no' (lists of 2-grams) and do this:

REGEXP_REPLACE('ga|an|na', 'ga|an|no', '')

Then based on change in length I can calculate my coeff.

But once applied to the table I get:

REGEXP_REPLACE second argument must be const and non-null

Is there any workaround for that? With simple REPLACE() second argument can be a field.

Maybe there is a better way to do it? I know, I can do UDF instead. But I wanted to avoid them here. We are running big tasks and UDFs are generally slower (at least in my experience) and are subject to different concurrency limit.

3 Answers3

1

You can have JavaScript code inside for BigQuery SQL queries.

To measure similarity you could use Levenshtein's distance with a query like this (from https://stackoverflow.com/a/33443564/132438):

SELECT *
FROM js(
(
  SELECT title,target FROM
   (SELECT 'hola' title, 'hello' target), (SELECT 'this is beautiful' title, 'that is fantastic' target) 
),
  title, target,
  // Output schema.
  "[{name: 'title', type:'string'},
    {name: 'target', type:'string'},
    {name: 'distance', type:'integer'}]",
  // The function
  "function(r, emit) {

  var _extend = function(dst) {
    var sources = Array.prototype.slice.call(arguments, 1);
    for (var i=0; i<sources.length; ++i) {
      var src = sources[i];
      for (var p in src) {
        if (src.hasOwnProperty(p)) dst[p] = src[p];
      }
    }
    return dst;
  };

  var Levenshtein = {
    /**
     * Calculate levenshtein distance of the two strings.
     *
     * @param str1 String the first string.
     * @param str2 String the second string.
     * @return Integer the levenshtein distance (0 and above).
     */
    get: function(str1, str2) {
      // base cases
      if (str1 === str2) return 0;
      if (str1.length === 0) return str2.length;
      if (str2.length === 0) return str1.length;

      // two rows
      var prevRow  = new Array(str2.length + 1),
          curCol, nextCol, i, j, tmp;

      // initialise previous row
      for (i=0; i<prevRow.length; ++i) {
        prevRow[i] = i;
      }

      // calculate current row distance from previous row
      for (i=0; i<str1.length; ++i) {
        nextCol = i + 1;

        for (j=0; j<str2.length; ++j) {
          curCol = nextCol;

          // substution
          nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
          // insertion
          tmp = curCol + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // deletion
          tmp = prevRow[j + 1] + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }

          // copy current col value into previous (in preparation for next iteration)
          prevRow[j] = curCol;
        }

        // copy last col value into previous (in preparation for next iteration)
        prevRow[j] = nextCol;
      }

      return nextCol;
    }

  };

  var the_title;

  try {
    the_title = decodeURI(r.title).toLowerCase();
  } catch (ex) {
    the_title = r.title.toLowerCase();
  }

  emit({title: the_title, target: r.target,
        distance: Levenshtein.get(the_title, r.target)});

  }")
Community
  • 1
  • 1
Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • Hi Felipe, so JavaScript is the only option? I wanted to avoid it because of speed limits. But If not possible - will use JS. – Nikolay Novozhlov Mar 22 '16 at 03:51
  • OMG inline javascript?! I was hoping something like this was possible but I couldn't find out how to do it. Thanks for posting this! – Emery Lapinski Mar 24 '16 at 12:44
  • Can you tell me where to find the documentation on this JS() function? I am having trouble finding it even though I now know it exists. Thanks! – Emery Lapinski Mar 24 '16 at 13:33
1

Below is tailored for similarity
Was used in How to perform trigram operations in Google BigQuery? and based on https://storage.googleapis.com/thomaspark-sandbox/udf-examples/pataky.js by @thomaspark

SELECT text1, text2, similarity FROM 
JS(
// input table
(
  SELECT * FROM 
  (SELECT 'mikhail' AS text1, 'mikhail' AS text2),
  (SELECT 'mikhail' AS text1, 'mike' AS text2),
  (SELECT 'mikhail' AS text1, 'michael' AS text2),
  (SELECT 'mikhail' AS text1, 'javier' AS text2),
  (SELECT 'mikhail' AS text1, 'thomas' AS text2)
) ,
// input columns
text1, text2,
// output schema
"[{name: 'text1', type:'string'},
  {name: 'text2', type:'string'},
  {name: 'similarity', type:'float'}]
",
// function
"function(r, emit) {

  var _extend = function(dst) {
    var sources = Array.prototype.slice.call(arguments, 1);
    for (var i=0; i<sources.length; ++i) {
      var src = sources[i];
      for (var p in src) {
        if (src.hasOwnProperty(p)) dst[p] = src[p];
      }
    }
    return dst;
  };

  var Levenshtein = {
    /**
     * Calculate levenshtein distance of the two strings.
     *
     * @param str1 String the first string.
     * @param str2 String the second string.
     * @return Integer the levenshtein distance (0 and above).
     */
    get: function(str1, str2) {
      // base cases
      if (str1 === str2) return 0;
      if (str1.length === 0) return str2.length;
      if (str2.length === 0) return str1.length;

      // two rows
      var prevRow  = new Array(str2.length + 1),
          curCol, nextCol, i, j, tmp;

      // initialise previous row
      for (i=0; i<prevRow.length; ++i) {
        prevRow[i] = i;
      }

      // calculate current row distance from previous row
      for (i=0; i<str1.length; ++i) {
        nextCol = i + 1;

        for (j=0; j<str2.length; ++j) {
          curCol = nextCol;

          // substution
          nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
          // insertion
          tmp = curCol + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // deletion
          tmp = prevRow[j + 1] + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }

          // copy current col value into previous (in preparation for next iteration)
          prevRow[j] = curCol;
        }

        // copy last col value into previous (in preparation for next iteration)
        prevRow[j] = nextCol;
      }

      return nextCol;
    }

  };

  var the_text1;

  try {
    the_text1 = decodeURI(r.text1).toLowerCase();
  } catch (ex) {
    the_text1 = r.text1.toLowerCase();
  }

  try {
    the_text2 = decodeURI(r.text2).toLowerCase();
  } catch (ex) {
    the_text2 = r.text2.toLowerCase();
  }

  emit({text1: the_text1, text2: the_text2,
        similarity: 1 - Levenshtein.get(the_text1, the_text2) / the_text1.length});

  }"
)
ORDER BY similarity DESC
Community
  • 1
  • 1
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Can you tell me where to find the documentation on this JS() function? I am having trouble finding it even though I now know it exists. Thanks! – Emery Lapinski Mar 24 '16 at 13:29
  • the link is in answer – Mikhail Berlyant Mar 24 '16 at 14:30
  • I guess I'm just not seeing it. The link you mention is to a UDF. I am interested in the "FROM JS(...)" part where you declare the Javascript function inline along with the rest of the SQL SELECT statement. Where can I find the documentation on this? – Emery Lapinski Mar 24 '16 at 14:46
  • @fxm27 you can find documentation on BigQuery User-Defined Functions here --> cloud.google.com/bigquery/user-defined-functions – Mikhail Berlyant Mar 24 '16 at 16:48
  • Again, it looks like those examples are using the UDF Editor in the BigQuery UI ("bigquery.defineFunction(..."), or loading a UDF file using (for example through the the --udf_resource option of the bq command line.) I want to know more about how you are defining the UDF inline in the SQL query, like you've done in the code you posted above (the part after "FROM JS(...)". What is JS()? Is that a BigQuery SQL function? Or a BigQuery SQL language construct? Where is it described in the documentation? Again, I'm just not seeing it. Which section are you looking at? – Emery Lapinski Mar 24 '16 at 17:37
  • sorry, it is not clear what you are looking for that is not in those links. looks like you have decent question in mind - I recommend you to post one with more details and you most likely will get proper answer – Mikhail Berlyant Mar 24 '16 at 17:40
  • Thanks. Here it is http://stackoverflow.com/questions/36207063/where-is-the-bigquery-documentation-describing-how-to-define-a-javascript-udf-fu – Emery Lapinski Mar 24 '16 at 18:10
0

REGEXP_REPLACE second argument must be const and non-null
Is there any workaround for that?

Below is just an idea/direction to address above question applied to logic you described:

I would "cook" these two strings upfront into 'ga|an|na' and 'ga|an|no' (lists of 2-grams) and do this: REGEXP_REPLACE('ga|an|na', 'ga|an|no', ''). Then based on change in length I can calculate my coeff.

The "workaround" is:

SELECT a.w AS w1, b.w AS w2, SUM(a.x = b.x) / COUNT(1) AS c
FROM (
  SELECT w, SPLIT(p, '|') AS x, ROW_NUMBER() OVER(PARTITION BY w) AS pos
  FROM 
    (SELECT 'gana' AS w, 'ga|an|na' AS p)
) AS a
JOIN (
  SELECT w, SPLIT(p, '|') AS x, ROW_NUMBER() OVER(PARTITION BY w) AS pos
  FROM 
    (SELECT 'gano' AS w, 'ga|an|no' AS p),
    (SELECT 'gamo' AS w, 'ga|am|mo' AS p),
    (SELECT 'kana' AS w, 'ka|an|na' AS p)
) AS b
ON a.pos = b.pos
GROUP BY w1, w2  

Maybe there is a better way to do it?

Below is the simple example of how Pair Similarity can be approached here (including building bigrams sets and calculation of coefficient:

SELECT
  a.word AS word1, b.word AS word2, 
  2 * SUM(a.bigram = b.bigram) / 
    (EXACT_COUNT_DISTINCT(a.bigram) + EXACT_COUNT_DISTINCT(b.bigram) ) AS c
FROM (
  SELECT word, char + next_char AS bigram
  FROM (
    SELECT word, char, LEAD(char, 1) OVER(PARTITION BY word ORDER BY pos) AS next_char
    FROM (
      SELECT word, SPLIT(word, '') AS char, ROW_NUMBER() OVER(PARTITION BY word) AS pos
      FROM
        (SELECT 'gana' AS word)
    )
  )
  WHERE next_char IS NOT NULL
  GROUP BY 1, 2
) a
CROSS JOIN (
  SELECT word, char + next_char AS bigram
  FROM (
    SELECT word, char, LEAD(char, 1) OVER(PARTITION BY word ORDER BY pos) AS next_char
    FROM (
      SELECT word, SPLIT(word, '') AS char, ROW_NUMBER() OVER(PARTITION BY word) AS pos
      FROM
        (SELECT 'gano' AS word)
    )
  )
  WHERE next_char IS NOT NULL
  GROUP BY 1, 2
) b
GROUP BY 1, 2
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230