Joining two DataFrames on a string column but ignoring accents/diacritics

Question

I have two DataFrames that I want to join based upon a string column but one of them encodes its names with very shoddy unicode support that drops accents and other diacritics:

fips = DataFrame(muni=["Adjuntas", "Anasco", "Bayamon", "Mayaguez"], fips=[72001, 72011, 72021, 72097])
pops = DataFrame(muni=["Adjuntas", "Añasco", "Bayamón", "Mayagüez"], pop=[17363, 26161, 169269, 71530])

I want to have leftjoin(pops, fips; on=:muni) use an approximate equality when joining that handles missing accents and diacritics (but ensures the base character is the same), or, failing that, some sort of ascii-ification string transform on pops.

score 1 · Accepted Answer · answered Nov 29 '20 at 17:07

Is this what you want?

julia> using Unicode

julia> leftjoin(transform(pops,
                          :muni =>
                          ByRow(x -> Unicode.normalize(x, stripmark=true)) =>
                          :muni,
                          copycols=false),
                fips, on=:muni)
4×3 DataFrame
 Row │ muni      pop     fips
     │ String    Int64   Int64?
─────┼──────────────────────────
   1 │ Adjuntas   17363   72001
   2 │ Anasco     26161   72011
   3 │ Bayamon   169269   72021
   4 │ Mayaguez   71530   72097

This is great, thank you! I had found https://stackoverflow.com/a/37511463/176071 and implemented the transliteration of that normalize |> regex workflow, but had missed the much more direct and succinct `stripmark` normalization option. Thanks! — mbauman, Nov 30 '20 at 18:54

Joining two DataFrames on a string column but ignoring accents/diacritics

1 Answers1