2

Example

with
    lhs_table as (
        select 'Nia' as firstName, 'Johnson Jr.' as lastName, 'FEMALE' as gender, 'UNLV' as school, 'Mountain West Conference' as conference union all
        select 'Jana' as firstName, 'Abdullah' as lastName, 'FEMALE' as gender, 'Cincinnati' as school, 'American Athletic Conference' as conference union all
        select 'Kay' as firstName, 'Sieper' as lastName, 'FEMALE' as gender, 'Loyola Maryland' as school, 'Patriot League' as conference union all
        select 'Alessia' as firstName, 'Capley' as lastName, 'FEMALE' as gender, 'Presbyterian' as school, 'Big South Conference' as conference union all
        select 'Aaliyah' as firstName, 'Stanley' as lastName, 'FEMALE' as gender, 'FGCU' as school, 'ASUN Conference' as conference 
    ),

    rhs_table as (
        select 1611707 as playerId, 'Kayla' as firstName, 'Sieper' as lastName, 'Loyola Maryland' as teamMarket, 'Patriot League' as conferenceName union all
        select 1380430 as playerId, 'Jana' as firstName, 'Abdullah' as lastName, 'Cincinnati' as teamMarket, 'American Athletic Conference' as conferenceName union all
        select 1234567 as playerId, 'Mela' as firstName, 'Aravada' as lastName, 'Cincinnati' as teamMarket, 'American Athletic Conference' as conferenceName union all
        select 1354105 as playerId, 'Aaliyah' as firstName, 'Stanley' as lastName, 'FGCU' as teamMarket, 'ASUN Conference' as conferenceName union all
        select 1138439 as playerId, 'Aaliyah' as firstName, 'Stanley' as lastName, 'Emory' as teamMarket, 'ASUN Conference' as conferenceName union all
        select 996101 as playerId, 'Nia' as firstName, 'Johnson' as lastName, 'Emmanuel (GA)' as teamMarket, 'Conference Carolinas' as conferenceName union all
        select 977605 as playerId, 'Nia' as firstName, 'Johnson' as lastName, 'UNLV' as teamMarket, 'Mountain West Conference' as conferenceName union all
        select 1329967 as playerId, 'Alessia' as firstName, 'Capley' as lastName, 'Presbyterian' as teamMarket, 'Big South Conference' as conferenceName union all
        select 995234 as playerId, 'Nia' as firstName, 'Johnson' as lastName, 'Delta St.' as teamMarket, 'Gulf South Conference' as conferenceName union all
        select 4567890 as playerId, 'Britney' as firstName, 'Capley' as lastName, 'Presbyterian' as teamMarket, 'Big South Conference' as conferenceName
    )

select 
    b.playerId
    ,a.*
from lhs_table as a 
left join rhs_table as b
    on a.firstName = b.firstName
    and a.lastName = b.lastName
    and a.school = b.teamMarket
    and a.conference = b.conferenceName

enter image description here

We are looking to join rhs_table onto lhs_table for the playerIds. Every person in lhs_table has a corresponding row in rhs_table, however the joins are not so simple:

  • For Nia Johnson Jr., Jr. is missing in rhs_table
  • For Kay Sieper, her full name Kayla is used in rhs_table
  • We want to ignore (ie not left join) RHS players on the wrong team (Nia on Emmanuel, Aaliyah on Emory).

Because of these mismatches, we need to fuzzy match instead. We have tried replacing on a.firstName = b.firstName with on a.firstName like b.firstName. Note that the conferences are the 1 column that do match exactly between tables, and also if it helps we can manually ensure that the teams match, although it would take some time. The important part is handling names not spelt the same.

The 5 correct playerIds, in order, are 977605, 1380430, 1611707, 1329967, 1354105. Can we somehow fuzzy match to get these playerIds?

Canovice
  • 9,012
  • 22
  • 93
  • 211

1 Answers1

2

Consider below approach

select 
  array_agg(b.playerId order by d limit 1) playerId,
  any_value(a).*
from (
  select 
    `bqutil.fn.levenshtein`(a.firstName, b.firstName) / greatest(length(a.firstName), length(b.firstName)) + 
    `bqutil.fn.levenshtein`(a.lastName, b.lastName) / greatest(length(a.lastName), length(b.lastName)) + 
    `bqutil.fn.levenshtein`(a.school, b.teamMarket) / greatest(length(a.school), length(b.teamMarket)) + 
    `bqutil.fn.levenshtein`(a.conference, b.conferenceName) / greatest(length(a.conference), length(b.conferenceName)) d,
    a, b
  from lhs_table as a 
  cross join rhs_table as b
)
group by to_json_string(a)               

if applied to sample data in your question - output is

enter image description here

You can play/experiment with variation of above, like below as an example

select 
  array_agg(b.playerId order by d limit 1) playerId,
  any_value(a).*
from (
  select 
    `bqutil.fn.levenshtein`(format('%t', a), format('%t', b)) d,
    a, b
  from lhs_table as a 
  cross join rhs_table as b
)
group by to_json_string(a)
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • if our actual `lhs_table` and `rhs_table` had ~100K rows each rather than 5-10, would that be computationally problematic with the `cross join`? – Canovice Feb 01 '22 at 02:48
  • 1
    If you have some (at least one) column that you know will have exact match - you can use join on such column(s) – Mikhail Berlyant Feb 01 '22 at 02:58
  • `cross join with_names_added as b on a.school = b.teamMarket and a.conference = b.conferenceName` - something like this doesn't work. Assuming these columns were exact matches, how could I join on those columns while still using cross join? – Canovice Feb 01 '22 at 03:23
  • I've edited the question such that the schools and conferences match – Canovice Feb 01 '22 at 03:26
  • 1
    Replace cross join with just join – Mikhail Berlyant Feb 01 '22 at 03:28
  • Function not found: `bqutil.fn.levenshtein` - is there a process to installing bqutil? – Canovice Feb 01 '22 at 04:02
  • 1
    check out https://github.com/GoogleCloudPlatform/bigquery-utils/tree/master/udfs/community – Mikhail Berlyant Feb 01 '22 at 04:09
  • I think my issue with the `function not found` is that the processing location for my bigquery is us-west2. when the processing zone is US, the function runs just fine. – Canovice Feb 01 '22 at 04:35
  • per https://github.com/GoogleCloudPlatform/bigquery-utils/blob/master/udfs/README.md, it says that `ll UDFs within this repository are available under the bqutil project on publicly shared datasets. Queries can then reference the shared UDFs via bqutil..().` I wonder if this means I cannot use the bqutil functions with tables in my own non-public dataset? – Canovice Feb 01 '22 at 04:47
  • I spun this convo into its own question which is more clearly defined - https://stackoverflow.com/questions/70935769/cannot-use-bigquery-udf-bqutil-in-processing-location-us-west-2 – Canovice Feb 01 '22 at 05:01