0

I have a table with people and their hometown names, but there are same cities with different written, see:

Name Hometown
João São Paulo
Maria Sao Paul
Pedro São Paulo.
Maria S. Paulo

And I need to process this in order to formalize that data to be like this:

Name Hometown
João São Paulo
Maria São Paulo
Pedro São Paulo
Maria São Paulo
  • The dataset has more than 2400 distinct values so I can't hard code.
  • I have a Country table dimension with all cities and their correct names.

I tried this stack and would it be exactly what I need but does not work with my entire dataset.

luisvenezian
  • 441
  • 5
  • 18

2 Answers2

2

Consider below approach (considering you have lookup table with all proper cities names) for purpose of example - I have it as CTE with just few ones

with cities as (
  select 'São Paulo' as city union all 
  select 'Los Angeles' union all 
  select 'Dnipro' union all 
  select 'Kyiv'
)
select Name, City as Hometown
from your_table 
left join cities 
on soundex(Hometown) = soundex(city)      

if applied to sample data in your question - output is

enter image description here

Note: you obviously need to take care of potential duplication in case if some cities sounds similar, in this case adding country constraints might help ...

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • `soundex` is not an appropriate "sounds like" algorithm, especially for non-English words. It only takes into account the first four consonants. There will be many, many false positives. – Schwern Mar 15 '22 at 23:24
  • this is just an option to try. meantime, as OP mentioned , he has table with all correct city names so there should be a match. as of false-positives - I mentioned in the note (at the bottom of my answer) that there will be dups that still needs to be addressed. :o) if country field exists - it can help. also dups can be further eliminated using different (for example distance) algo, etc. – Mikhail Berlyant Mar 15 '22 at 23:27
  • @MikhailBerlyant I appreciate your thought and tried it! unfortunately I had a lot of duplications for apparently two main reasons: a lot of cities sounds very similar and also as Schwern said, seems like it does not works well with non-English words and even the ones which does not sound similar returns the same result with `soundex`, see `select city, soundex(city) from unnest(['Barcelos','Brasileira','Brasília','Brejolândia','Barros Cassal','Berizal','Brejo Alegre']) as city` – luisvenezian Mar 16 '22 at 03:13
  • See notes in my answer - at the very bottom, as well as comments above on use additional algos like distance for example to resolve dups. You can find plenty examples here on SO including answered by me – Mikhail Berlyant Mar 16 '22 at 03:14
  • 1
    @MikhailBerlyant I will enrich my data to use state constraints and see the improvements, it would be great if i could solve this entirely by using only BigQuery – luisvenezian Mar 16 '22 at 03:29
2

First, the basics.

  1. Strip non-letters.
  2. Case fold.
  3. Convert to ASCII equivalents.

The first one is straight-forward, strip out everything which isn't a letter so São Paulo and São Paulo. are both SãoPaulo.

Case folding is also straight-forward, change everything to lower or upper case. são paulo and São Paulo compare the same.

Finally, convert them to the normal ASCII equivalents. For example, são becomes sao.

With this normalization done, the issues of spaces, extra characters, accents, and cases are taken are of. I would recommend doing this outside of BigQuery and in a language like Python. Do a select distinct and transform and compare each value using libraries such as unidecode.


You can then employ some heuristics to try and find "close enough" matches. One example is the Levenshtein distance which is the number of substitutions, insertions, and deletions one needs to do to turn one string into another. Python has a Levenshtein library.

For example, Sao Paul and Sao Paulo have a Levenshtein distance of one; add one letter. S Paulo and Sao Paulo have a Levenshtein distance of two, add two letters. Sao Paulo and Saint Paul have a Levenshtein distance of four; change o to i, add n and t, remove o.

Again, I'd recommend doing this with a regular programming language and then writing the normalized results back to BigQuery.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • You supplied great content and explanation and for its reason I upvoted your answer but ended up accepting Mikhael one because it had satisfactory results when added constraints and I could solve all by using BigQuery. I just want to register my gratefulness. – luisvenezian Mar 16 '22 at 20:05