84

I am trying to create a 'normalized' copy of a string, to help reduce duplicate names in a database. The names contain many international characters (ie. accented letters), and I want to create a copy with the accents removed.

I did come across the method below, but cannot get it to work. I can't seem to find what the Unicode Hacks plugin is.

  # Utility method that retursn an ASCIIfied, downcased, and sanitized string.
  # It relies on the Unicode Hacks plugin by means of String#chars. We assume
  # $KCODE is 'u' in environment.rb. By now we support a wide range of latin
  # accented letters, based on the Unicode Character Palette bundled inMacs.
  def self.normalize(str)
     n = str.chars.downcase.strip.to_s
     n.gsub!(/[à áâãäåÄÄ?]/u,    'a')
     n.gsub!(/æ/u,                  'ae')
     n.gsub!(/[ÄÄ?]/u,                'd')
     n.gsub!(/[çÄ?ÄÄ?Ä?]/u,          'c')
     n.gsub!(/[èéêëÄ?Ä?Ä?Ä?Ä?]/u, 'e')
     n.gsub!(/Æ?/u,                   'f')
     n.gsub!(/[ÄÄ?ġģ]/u,            'g')
     n.gsub!(/[ĥħ]/,                'h')
     n.gsub!(/[ììíîïīĩĭ]/u,     'i')
     n.gsub!(/[įıijĵ]/u,           'j')
     n.gsub!(/[ķĸ]/u,               'k')
     n.gsub!(/[Å?ľĺļÅ?]/u,         'l')
     n.gsub!(/[ñÅ?Å?Å?Å?Å?]/u,       'n')
     n.gsub!(/[òóôõöøÅÅ?ÅÅ]/u,  'o')
     n.gsub!(/Å?/u,                  'oe')
     n.gsub!(/Ä?/u,                   'q')
     n.gsub!(/[Å?Å?Å?]/u,             'r')
     n.gsub!(/[Å?Å¡Å?ÅÈ?]/u,          's')
     n.gsub!(/[ťţŧÈ?]/u,           't')
     n.gsub!(/[ùúûüūůűŭũų]/u,'u')
     n.gsub!(/ŵ/u,                   'w')
     n.gsub!(/[ýÿŷ]/u,             'y')
     n.gsub!(/[žżź]/u,             'z')
     n.gsub!(/\s+/,                   ' ')
     n.gsub!(/[^\sa-z0-9_-]/,          '')
     n
  end

Do I need to 'require' a particular library/gem? Or maybe someone could recommend another way to go about this.

I am not using Rails, nor do I plan on doing so.

paradoja
  • 3,055
  • 2
  • 25
  • 34
Gus Shortz
  • 1,711
  • 1
  • 15
  • 24
  • 1
    Which ruby version are you using? – Huluk Mar 28 '13 at 16:21
  • Take a look at http://stackoverflow.com/questions/1268289/how-to-get-rid-of-non-ascii-characters-in-ruby – MurifoX Mar 28 '13 at 16:28
  • 3
    you could also look at: https://github.com/norman/unidecoder – amalrik maia Mar 28 '13 at 16:34
  • I'm using Ruby 1.9.3, I'll take a look at both of those possible solutions, all I need is the above method's replacement of the listed characters, so if those solutions can do that great and thanks :) – Gus Shortz Mar 28 '13 at 20:30
  • I finally found some references to the Unicode Hack plugin (http://www.railslodge.com/plugins/316-unicode-hacks), that provides the `chars` method needed for the `normalize` method I mentioned. But it seems to no longer be supported – Gus Shortz Mar 29 '13 at 01:49

5 Answers5

251

I generally use I18n to handle this:

1.9.3p392 :001 > require "i18n"
 => true
1.9.3p392 :002 > I18n.transliterate("Hé les mecs!")
 => "He les mecs!"
user2398029
  • 6,699
  • 8
  • 48
  • 80
  • 3
    [The documentation](http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-transliterate). Being able to set transliterations on a per-locale basis is also very powerful. – Paul Fioravanti Mar 29 '13 at 10:54
  • 13
    This may not do what you expect on characters that don't have basic Latin mappings--for example Chinese characters. It just turns them to question marks. `(main)> I18n.transliterate("雙屬性集合之空間分群演算法-應用於地理資料")` `=> "?????????????-???????"` – David Mar 25 '14 at 18:20
  • 20
    Just a note for plain ruby , if `I18n::InvalidLocale: :en is not a valid locale` is thrown, use `I18n.available_locales = [:en]` before `I18n.transliterate` – Alter Lagos Jul 15 '15 at 04:09
  • 1
    Note: This does not work for everything. Example "Bùi Viện" gets translated to "Bui Vi?n" – CHawk Apr 17 '16 at 13:31
  • 3
    Didn't work for me: `(main)> I18n.transliterate "ŠKODA" => "ŠKODA"` – Michael Jul 12 '16 at 14:30
  • Those cases should be reported as I18n bugs. – user2398029 Jul 22 '16 at 00:46
  • It depends too much on configuration, I think. Does not work for me too, tried specifying different locales. – kolen May 04 '18 at 17:06
35

The parameterize method could be a nice and simple solution to remove special characters in order to use the string as human readable identifier:

> "Françoise Isaïe".parameterize
=> "francoise-isaie"
AlexGuti
  • 3,063
  • 1
  • 27
  • 28
20

So far the following is the only way I've been able to accomplish what I need:

str.tr(
"ÀÁÂÃÄÅàáâãäåĀāĂ㥹ÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž",
"AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz")

But using this feels very 'hackish', and I would love to find a better way.

Gus Shortz
  • 1,711
  • 1
  • 15
  • 24
  • 1
    This works only for ISO-8859-1. What makes you think it works for UTF-8? – pts Nov 29 '14 at 19:58
  • 4
    This one works for UTF-8 and ruby 2.2.3, and does exactly what I needed. Lacks some Romanian characters though. I've aded them: `string.tr( "ÀÁÂÃÄÅàáâãäåĀāĂ㥹ÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšȘșſŢţŤťŦŧȚțÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž", "AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSsSssTtTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz")` – Alexander Jun 24 '17 at 09:21
  • Thanks it worked. Lack some Vietnamese chars. I 've added them: `tr("ÀÁÂÃÄÅàáâãäåĀāĂ㥹ạảÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêểệễëĒēĔĕĖėĘęĚěẹĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıịỉĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôộỗổõöøŌōŎŏŐőọỏơởợỡŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųụưủửữựŴŵÝýÿŶŷŸŹźŻżŽžứừửựữốồộỗổờóợỏỡếềễểệẩẫấầậỳỹýỷỵặẵẳằắ", "AAAAAAaaaaaaAaAaAaaaCcCcCcCcCcDdDdDdEEEEeeeeeeEeEeEeEeEeeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiiiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOoooooooooOoOoOoooooooRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuuuuuuuWwYyyYyYZzZzZzuuuuuooooooooooeeeeeaaaaayyyyyaaaaa")` – duyetpt Jul 16 '21 at 08:14
7

Solution:

DIACRITICS = [*0x1DC0..0x1DFF, *0x0300..0x036F, *0xFE20..0xFE2F].pack('U*')

def removeaccents(str)
  str
    .unicode_normalize(:nfd)
    .tr(DIACRITICS, '')
    .unicode_normalize(:nfc)
end

Example (before/after):

ÀÁÂÃÄÅàáâãäåĀāĂ㥹ạảÇçĆćĈĉĊċČčĎďÈÉÊËèéêểệễëĒēĔĕĖėĘęĚěẹĜĝĞğĠġĢģĤĥÌÍÎÏìíîïĨĩĪīĬĭĮįİıịỉĴĵĶķĸĹĺĻļĽľÑñŃńŅņŇňÒÓÔÕÖòóôộỗổõöŌōŎŏŐőọỏơởợỡŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųụưủửữựŴŵÝýÿŶŷŸŹźŻżŽžứừửựữốồộỗổờóợỏỡếềễểệẩẫấầậỳỹýỷỵặẵẳằắ
AAAAAAaaaaaaAaAaAaaaCcCcCcCcCcDdEEEEeeeeeeeEeEeEeEeEeeGgGgGgGgHhIIIIiiiiIiIiIiIiIıiiJjKkĸLlLlLlNnNnNnNnOOOOOooooooooOoOoOoooooooRrRrRrSsSsSsSsſTtTtUUUUuuuuUuUuUuUuUuUuuuuuuuWwYyyYyYZzZzZzuuuuuooooooooooeeeeeaaaaayyyyyaaaaa

Explanations:

  • Decompose the single-codepoint characters into their constituting codepoints characters (where applicable).
  • Remove the diacritical mark codepoints (Unicode 15.0.0 reference) found in the following blocks:
    • Combining Diacritical Marks Supplement (U+1DC0 → U+1DFF)
    • Combining Diacritical Marks (U+0300 → U+036F)
    • Combining Half Marks (U+FE20 → U+FE2F)
  • Recompose the characters.

Caveats:

  • While these diacritics are predominantly used for text, some of them can also be used with symbols. These symbols will see these diacritics removed when they shouldn't be.
  • Obscure codepoints such as subtending marks are not removed. Despite their naming, they are not treated as combining marks by the unicode reference but as format characters. An example is the arabic hamza above ◌ٔ (U+0654) that probably doesn't even get properly displayed in your browser.
  • Not a caveat per se but worth nothing: diacritics that are preceded by a space or a breaking space are also removed. They are displayed as standalone characters in some text-rendering software so it may be undesired.
noname120
  • 93
  • 1
  • 8
  • 1
    I really like this solution. No gems or anything else needed. Just simple and clean code. I hope this gets more votes. IMO, this should be the accepted answer. – luis.madrigal Nov 03 '22 at 06:02
6

If you are using rails:

"L'Oréal".parameterize(separator: ' ')
Dorian
  • 7,749
  • 4
  • 38
  • 57
Navid Khan
  • 979
  • 11
  • 24