13

I'm wondering if anyone knows of a way to measure string similarity in BigQuery.

Seems like would be a neat function to have.

My case is i need to compare the similarity of two urls as want to be fairly sure they refer to the same article.

I can find examples using javascript so maybe a UDF is the way to go but i've not used UDF's at all (or javascript for that matter :) )

Just wondering if there may be a way using existing regex functions or if anyone might be able to get me started with porting the javascript example into a UDF.

Any help much appreciated, thanks

EDIT: Adding some example code

So if i have a UDF defined as:

// distance function

function levenshteinDistance (row, emit) {

  //if (row.inputA.length <= 0 ) {var myresult = row.inputB.length};
  if (typeof row.inputA === 'undefined') {var myresult = 1};
  if (typeof row.inputB === 'undefined') {var myresult = 1};
  //if (row.inputB.length <= 0 ) {var myresult = row.inputA.length};

    var myresult = Math.min(
        levenshteinDistance(row.inputA.substr(1), row.inputB) + 1,
        levenshteinDistance(row.inputB.substr(1), row.inputA) + 1,
        levenshteinDistance(row.inputA.substr(1), row.inputB.substr(1)) + (row.inputA[0] !== row.inputB[0] ? 1 : 0)
    ) + 1;

  emit({outputA: myresult})

}

bigquery.defineFunction(
  'levenshteinDistance',                           // Name of the function exported to SQL
  ['inputA', 'inputB'],                    // Names of input columns
  [{'name': 'outputA', 'type': 'integer'}],  // Output schema
  levenshteinDistance                       // Reference to JavaScript UDF
);

// make a test function to test individual parts

function test(row, emit) {
  if (row.inputA.length <= 0) { var x = row.inputB.length} else { var x = row.inputA.length};
  emit({outputA: x});
}

bigquery.defineFunction(
  'test',                           // Name of the function exported to SQL
  ['inputA', 'inputB'],                    // Names of input columns
  [{'name': 'outputA', 'type': 'integer'}],  // Output schema
  test                       // Reference to JavaScript UDF
);

Any i try test with a query such as:

SELECT outputA FROM (levenshteinDistance(SELECT "abc" AS inputA, "abd" AS inputB))

I get error:

Error: TypeError: Cannot read property 'substr' of undefined at line 11, columns 38-39 Error Location: User-defined function

It seems like maybe row.inputA is not a string perhaps or for some reason string functions not able to work on it. Not sure if this is a type issue or something funny about what utils the UDF is able to use by default.

Again any help much appreciated, thanks.

andrewm4894
  • 1,451
  • 4
  • 17
  • 37

8 Answers8

13

Ready to use shared UDFs - Levenshtein distance:

SELECT fhoffa.x.levenshtein('felipe', 'hoffa')
 , fhoffa.x.levenshtein('googgle', 'goggles')
 , fhoffa.x.levenshtein('is this the', 'Is This The')

6  2  0

Soundex:

SELECT fhoffa.x.soundex('felipe')
 , fhoffa.x.soundex('googgle')
 , fhoffa.x.soundex('guugle')

F410  G240  G240

Fuzzy choose one:

SELECT fhoffa.x.fuzzy_extract_one('jony' 
  , (SELECT ARRAY_AGG(name) 
   FROM `fh-bigquery.popular_names.gender_probabilities`) 
  #, ['john', 'johnny', 'jonathan', 'jonas']
)

johnny

How-to:

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • I tried `fuzzy_extract_one` but got `Invalid choices at gs://fh-bigquery/js/fuzzball.umd.min.js line 1, columns 5895-5896` and can't figured out why – luisvenezian Mar 15 '22 at 22:29
  • Oh @luisvenezian I work for Snowflake now, so I won't be able to fix my old queries. But in case you are interested, Snowflake has native Levenhstein https://docs.snowflake.com/en/sql-reference/functions/editdistance.html – Felipe Hoffa Mar 15 '22 at 22:34
8

If you're familiar with Python, you can use the functions defined by fuzzywuzzy in BigQuery using external libraries loaded from GCS.

Steps:

  1. Download the javascript version of fuzzywuzzy (fuzzball)
  2. Take the compiled file of the library: dist/fuzzball.umd.min.js and rename it to a clearer name (like fuzzball)
  3. Upload it to a google cloud storage bucket
  4. Create a temp function to use the lib in your query (set the path in OPTIONS to the relevant path)
CREATE TEMP FUNCTION token_set_ratio(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS """
  return fuzzball.token_set_ratio(a, b);
"""
OPTIONS (
  library="gs://my-bucket/fuzzball.js");

with data as (select "my_test_string" as a, "my_other_string" as b)

SELECT  a, b, token_set_ratio(a, b) from data
Colin Le Nost
  • 460
  • 4
  • 10
  • This is a great idea, I've not been able to get it running however. Returns `SyntaxError: Invalid regular expression: : Range out of order in character class at gs://test-udfs/fuzzball.js line 4, columns 9575-9576` – tw0000 Oct 22 '19 at 00:19
  • I just retried after having re downloaded the file and it's still working for me. Are you using standard SQL ? On which strings ? – Colin Le Nost Jan 08 '20 at 14:29
4

Levenshtein via JS would be the way to go. You can use the algorithm to get absolute string distance, or convert it to a percentage similarity by simply calculating abs(strlen - distance / strlen).

The easiest way to implement this would be to define a Levenshtein UDF that takes two inputs, a and b, and calculates the distance between them. The function could return a, b, and the distance.

To invoke it, you'd then pass in the two URLs as columns aliased to 'a' and 'b':

SELECT a, b, distance
FROM
  Levenshtein(
     SELECT
       some_url AS a, other_url AS b
     FROM
       your_table
  )
thomaspark
  • 488
  • 3
  • 14
  • Thanks - thats how i'm looking to use it but the JS solution i linked to does not seem to work out of the box if i try implement it in BQ UDF and just not sure what i'm doing wrong. – andrewm4894 Nov 02 '15 at 13:21
  • 1
    Hi, here is a Levenshtein implementation that you can use : https://storage.googleapis.com/thomaspark-sandbox/udf-examples/pataky.js (Reference as "gs://thomaspark-sandbox/udf-examples/pataky.js"). I wrote it for a Wikipedia example, so it's bound to input columns "target" and "title". It returns the absolute edit numeric distance. – thomaspark Nov 04 '15 at 02:45
  • Thanks - defo would not have figured that out. Will use it to learn and play around with also. cheers, much appreciated. – andrewm4894 Nov 04 '15 at 16:03
  • Glad to help! We'll be releasing a catalogue of JS functions in the future that users can pull into their queries; we can definitely include Levenshtein. – thomaspark Nov 05 '15 at 18:09
  • That's a great idea, would be a great starting point to learn also if we could see the code somewhere, tinker and then contribute back somewhere. Cheers – andrewm4894 Nov 06 '15 at 08:48
4

Below is quite simpler version for Hamming Distance by using WITH OFFSET instead of ROW_NUMBER() OVER()

#standardSQL
WITH Input AS (
  SELECT 'abcdef' AS strings UNION ALL
  SELECT 'defdef' UNION ALL
  SELECT '1bcdef' UNION ALL
  SELECT '1bcde4' UNION ALL
  SELECT '123de4' UNION ALL
  SELECT 'abc123'
)
SELECT 'abcdef' AS target, strings, 
  (SELECT COUNT(1) 
    FROM UNNEST(SPLIT('abcdef', '')) a WITH OFFSET x
    JOIN UNNEST(SPLIT(strings, '')) b WITH OFFSET y
    ON x = y AND a != b) hamming_distance
FROM Input
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
3

I couldn't find a direct answer to this, so I propose this solution, in standard SQL

#standardSQL
CREATE TEMP FUNCTION HammingDistance(a STRING, b STRING) AS (
  (
  SELECT
    SUM(counter) AS diff
  FROM (
    SELECT
      CASE
        WHEN X.value != Y.value THEN 1
        ELSE 0
      END AS counter
    FROM (
      SELECT
        value,
        ROW_NUMBER() OVER() AS row
      FROM
        UNNEST(SPLIT(a, "")) AS value ) X
    JOIN (
      SELECT
        value,
        ROW_NUMBER() OVER() AS row
      FROM
        UNNEST(SPLIT(b, "")) AS value ) Y
    ON
      X.row = Y.row )
   )
);

WITH Input AS (
  SELECT 'abcdef' AS strings UNION ALL
  SELECT 'defdef' UNION ALL
  SELECT '1bcdef' UNION ALL
  SELECT '1bcde4' UNION ALL
  SELECT '123de4' UNION ALL
  SELECT 'abc123'
)

SELECT strings, 'abcdef' as target, HammingDistance('abcdef', strings) as hamming_distance
FROM Input;

Compared to other solutions (like this one), it takes two strings (of the same length, following the definition for hamming distance) and outputs the expected distance.

jquinter
  • 144
  • 4
3

I did it like this:

CREATE TEMP FUNCTION trigram_similarity(a STRING, b STRING) AS (
  (
    WITH a_trigrams AS (
      SELECT
        DISTINCT tri_a
      FROM
        unnest(ML.NGRAMS(SPLIT(LOWER(a), ''), [3,3])) AS tri_a
    ),
    b_trigrams AS (
      SELECT
        DISTINCT tri_b
      FROM
        unnest(ML.NGRAMS(SPLIT(LOWER(b), ''), [3,3])) AS tri_b
    )
    SELECT
      COUNTIF(tri_b IS NOT NULL) / COUNT(*)
    FROM
      a_trigrams
      LEFT JOIN b_trigrams ON tri_a = tri_b
  )
);

Here is a comparison to Postgres's pg_trgm:

select trigram_similarity('saemus', 'seamus');
-- 0.25 vs. pg_trgm 0.272727

select trigram_similarity('shamus', 'seamus');
-- 0.5 vs. pg_trgm 0.4

I gave the same answer on How to perform trigram operations in Google BigQuery?

Seamus Abshere
  • 8,326
  • 4
  • 44
  • 61
2

While I was looking for the answer Felipe above, I worked on my own query and ended up with two versions, one which I called string approximation and another string resemblance.

The first is looking at the shortest distance between letters of source string and test string and returns a score between 0 and 1 where 1 is a complete match. It will always score based on the longest string of the two. It turns out to return similar results to the Levensthein distance.

#standardSql
CREATE OR REPLACE FUNCTION `myproject.func.stringApproximation`(sourceString STRING, testString STRING) AS (
(select avg(best_result) from (
                              select if(length(testString)<length(sourceString), sourceoffset, testoffset) as ref, 
                              case 
                                when min(result) is null then 0
                                else 1 / (min(result) + 1) 
                              end as best_result,
                              from (
                                       select *,
                                              if(source = test, abs(sourceoffset - (testoffset)),
                                              greatest(length(testString),length(sourceString))) as result
                                       from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
                                                cross join
                                            (select *
                                             from unnest(split(lower(testString),'')) as test with offset as testoffset)
                                       ) as results
                              group  by ref
                                 )
        )
);

The second is a variation of the first, where it will look at sequences of matching distances, so that a character matching at equal distance from the character preceding or following it will count as one point. This works quite well, better than string approximation but not quite as well as I would like to (see example output below).

    #standarSql
    CREATE OR REPLACE FUNCTION `myproject.func.stringResemblance`(sourceString STRING, testString STRING) AS (
(
select avg(sequence)
from (
      select ref,
             if(array_length(array(select * from comparison.collection intersect distinct
                                   (select * from comparison.before))) > 0
                    or array_length(array(select * from comparison.collection intersect distinct
                                          (select * from comparison.after))) > 0
                 , 1, 0) as sequence

      from (
               select ref,
                      collection,
                      lag(collection) over (order by ref)  as before,
                      lead(collection) over (order by ref) as after
               from (
                     select if(length(testString) < length(sourceString), sourceoffset, testoffset) as ref,
                            array_agg(result ignore nulls)                                          as collection
                     from (
                              select *,
                                     if(source = test, abs(sourceoffset - (testoffset)), null) as result
                              from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
                                       cross join
                                   (select *
                                    from unnest(split(lower(testString),'')) as test with offset as testoffset)
                              ) as results
                     group by ref
                        )
               ) as comparison
      )

)
);

Now here is a sample of result:

#standardSQL
with test_subjects as (
  select 'benji' as name union all
  select 'benjamin' union all
  select 'benjamin alan artis' union all
  select 'ben artis' union all
  select 'artis benjamin' 
)

select name, quick.stringApproximation('benjamin artis', name) as approxiamtion, quick.stringResemblance('benjamin artis', name) as resemblance
from test_subjects

order by resemblance desc

This returns

+---------------------+--------------------+--------------------+
| name                | approximation      | resemblance        |
+---------------------+--------------------+--------------------+
| artis benjamin      | 0.2653061224489796 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| benjamin alan artis | 0.6078947368421053 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| ben artis           | 0.4142857142857142 | 0.7142857142857143 |
+---------------------+--------------------+--------------------+
| benjamin            | 0.6125850340136053 | 0.5714285714285714 |
+---------------------+--------------------+--------------------+
| benji               | 0.36269841269841263| 0.28571428571428575|
+----------------------------------------------------------------

Edited: updated the resemblance algorithm to improve results.

neydroydrec
  • 6,973
  • 9
  • 57
  • 89
0

Try Flookup for Google Sheets... it's definitely faster than Levenshtein distance and it calculates percentage similarities right out of the box. One Flookup function you might find useful is this:

FUZZYMATCH (string1, string2)

Parameter Details

  1. string1: compares to string2.
  2. string2: compares to string1.

The percentage similarity is then calculated based on these comparisons. Both parameters can be ranges.

I'm currently trying to optimise it for large data sets so you feedback would be very welcome.

Edit: I'm the creator of Flookup.