0

I have two tables (main_table, codes_table). I need to join the two tables on column code. Example data below:

main_table

id customer code
1 Fred Acme Residential
2 Sue Acme Business
3 Bud Acme & Old State

codes_table

id code group
1 Acme X19CD
2 Acme Business G933C

My data is in Google Big Query and I am hoping to use native SQL. I am trying to come up with SQL that would allow me to join main_table to codes_table such that I would get the following output:

Results

cust_id customer code group
1 Fred Acme Residential X19CD
2 Sue Acme Business G933C
3 Bud Acme & Old State X19CD

Appreciate any thoughts on how one can accomplish this with SQL.

Essentially the code in the customer table could be various permutations of "Acme ". The issue I have had that trying the various ways I have to join to two tables ends up getting both code_table rows as they both start with "Acme". What I am trying to do is join the two where the main_table.code matches the most characters from the code_table.code.

2 Answers2

0

An option of course is the LIKE operator but to avoid the scenario you describe of multiple matches you'd also need to partition and filter the join condition (also known as windowing) in order to only join with the most LIKE match. I.E. include in your join a partition over code where partition_id = 1 to only grab the topmost value. Lot's of examples of windowing out there.

https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls#syntax

Another fun one to try might be to use the soundex() function which BigQuery does support. This creates a code for words that sound the same. You'd then join on the soundex() result instead of the actual codes. But my suspicion is that this won't be accurate enough and you'll get a bunch of false positives as well as potential multiple results which would require windowing as above.

https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#soundex

My true recommendation over either of those options would instead be either a data cleanup effort to standardize your codes. That's painful, but it's only going to get worse the longer you ignore it. OR, build a mapping table that essentially links records with the "version" they should join to. Then you'd always use that map table as a many-to-many within your join.

Lot's of workarounds, but only one true solution. Fix your data quality.

RThomas
  • 10,702
  • 2
  • 48
  • 61
  • I was able to get it to work using the windowing function. I first joined based on the codes_table joining to the main_table on the value that was the length of codes_table (this caused duplicates for 'Acme Business' since it was similar. Then I did some windowing to rank the matched values and finally I filtered it to take the highest match. So windowing did the trick for me. It isn't perfect as there are conditions one could have where you would get a false positive in the matching, but it worked for me. Ultimately source data clean up would be ideal, but this was the next best thing. – J Schuh-Dodge Oct 28 '22 at 12:55
  • Next couple comments are the code I used to prove the logic out. Sorry about the formatting, still figuring out how to do that in Stack Overflow. – J Schuh-Dodge Oct 28 '22 at 13:30
  • with main_table as ( SELECT 1 as ID, 'Fred' as CUSTOMER, 'Acme Residential' as CODE UNION ALL SELECT 2 as ID, 'Sue' as CUSTOMER, 'Acme Business' as CODE UNION ALL SELECT 3 as ID, 'Bud' as CUSTOMER, 'Acme & Old State' as CODE ), codes_table as ( SELECT 1 as ID, 'Acme' as CODE, 'X19CD' as GROUP_CODE UNION ALL SELECT 2 as ID, 'Acme Business' as CODE, 'G933C' as GROUP_CODE ), – J Schuh-Dodge Oct 28 '22 at 13:31
  • result_table_wCalcs as ( SELECT *, LENGTH(c.CODE)/LENGTH(m.CODE) as MATCHPERCENT, MAX(LENGTH(c.CODE)/LENGTH(m.CODE)) OVER (PARTITION BY m.CUSTOMER) AS FINDERSKEEPERS FROM main_table m LEFT OUTER JOIN codes_table c ON c.CODE = LEFT(m.CODE, LENGTH(c.CODE)) ) SELECT * FROM result_table_wCalcs WHERE MATCHPERCENT = FINDERSKEEPERS ; – J Schuh-Dodge Oct 28 '22 at 13:31
0

Investigate the LIKE keyword. Investigate the use of the soundex() function in string matching. You may have to write your own function to do such matching. For example, if you want to match to strings on phonetic closeness, you can construct such a clause this way:

WHERE soundex(<fieldname>) = soundex(<input>)

I guess what I'm suggesting is:

WHERE mymatchingfunction(<fieldname>) = mymatchingfunction(<input>)

While this is computationally expensive, and you lose the benefit of indexing, it will deliver the desired results.

I hope this helps.

David Buttrick
  • 230
  • 1
  • 12