Ruby method to remove accents from UTF-8 international characters

Question

I am trying to create a 'normalized' copy of a string, to help reduce duplicate names in a database. The names contain many international characters (ie. accented letters), and I want to create a copy with the accents removed.

I did come across the method below, but cannot get it to work. I can't seem to find what the Unicode Hacks plugin is.

  # Utility method that retursn an ASCIIfied, downcased, and sanitized string.
  # It relies on the Unicode Hacks plugin by means of String#chars. We assume
  # $KCODE is 'u' in environment.rb. By now we support a wide range of latin
  # accented letters, based on the Unicode Character Palette bundled inMacs.
  def self.normalize(str)
     n = str.chars.downcase.strip.to_s
     n.gsub!(/[Ã Ã¡Ã¢Ã£Ã¤Ã¥ÄÄ?]/u,    'a')
     n.gsub!(/Ã¦/u,                  'ae')
     n.gsub!(/[ÄÄ?]/u,                'd')
     n.gsub!(/[Ã§Ä?ÄÄ?Ä?]/u,          'c')
     n.gsub!(/[Ã¨Ã©ÃªÃ«Ä?Ä?Ä?Ä?Ä?]/u, 'e')
     n.gsub!(/Æ?/u,                   'f')
     n.gsub!(/[ÄÄ?Ä¡Ä£]/u,            'g')
     n.gsub!(/[Ä¥Ä§]/,                'h')
     n.gsub!(/[Ã¬Ã¬ÃÃ®Ã¯Ä«Ä©Ä]/u,     'i')
     n.gsub!(/[Ä¯Ä±Ä³Äµ]/u,           'j')
     n.gsub!(/[Ä·Ä¸]/u,               'k')
     n.gsub!(/[Å?Ä¾ÄºÄ¼Å?]/u,         'l')
     n.gsub!(/[Ã±Å?Å?Å?Å?Å?]/u,       'n')
     n.gsub!(/[Ã²Ã³Ã´ÃµÃ¶Ã¸ÅÅ?ÅÅ]/u,  'o')
     n.gsub!(/Å?/u,                  'oe')
     n.gsub!(/Ä?/u,                   'q')
     n.gsub!(/[Å?Å?Å?]/u,             'r')
     n.gsub!(/[Å?Å¡Å?ÅÈ?]/u,          's')
     n.gsub!(/[Å¥Å£Å§È?]/u,           't')
     n.gsub!(/[Ã¹ÃºÃ»Ã¼Å«Å¯Å±ÅÅ©Å³]/u,'u')
     n.gsub!(/Åµ/u,                   'w')
     n.gsub!(/[Ã½Ã¿Å·]/u,             'y')
     n.gsub!(/[Å¾Å¼Åº]/u,             'z')
     n.gsub!(/\s+/,                   ' ')
     n.gsub!(/[^\sa-z0-9_-]/,          '')
     n
  end

Do I need to 'require' a particular library/gem? Or maybe someone could recommend another way to go about this.

I am not using Rails, nor do I plan on doing so.

Take a look at http://stackoverflow.com/questions/1268289/how-to-get-rid-of-non-ascii-characters-in-ruby — MurifoX, Mar 28 '13 at 16:28
you could also look at: https://github.com/norman/unidecoder — amalrik maia, Mar 28 '13 at 16:34
I'm using Ruby 1.9.3, I'll take a look at both of those possible solutions, all I need is the above method's replacement of the listed characters, so if those solutions can do that great and thanks :) — Gus Shortz, Mar 28 '13 at 20:30
I finally found some references to the Unicode Hack plugin (http://www.railslodge.com/plugins/316-unicode-hacks), that provides the `chars` method needed for the `normalize` method I mentioned. But it seems to no longer be supported — Gus Shortz, Mar 29 '13 at 01:49

user2398029 · Accepted Answer · 2015-02-15T09:33:27.657

251

I generally use I18n to handle this:

1.9.3p392 :001 > require "i18n"
 => true
1.9.3p392 :002 > I18n.transliterate("Hé les mecs!")
 => "He les mecs!"

edited Feb 15 '15 at 09:33

answered Mar 29 '13 at 03:29

user2398029

6,699
8
48
80

3

[The documentation](http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-transliterate). Being able to set transliterations on a per-locale basis is also very powerful. – Paul Fioravanti Mar 29 '13 at 10:54
13

This may not do what you expect on characters that don't have basic Latin mappings--for example Chinese characters. It just turns them to question marks. `(main)> I18n.transliterate("雙屬性集合之空間分群演算法-應用於地理資料")` `=> "?????????????-???????"` – David Mar 25 '14 at 18:20
20

Just a note for plain ruby , if `I18n::InvalidLocale: :en is not a valid locale` is thrown, use `I18n.available_locales = [:en]` before `I18n.transliterate` – Alter Lagos Jul 15 '15 at 04:09
1

Note: This does not work for everything. Example "Bùi Viện" gets translated to "Bui Vi?n" – CHawk Apr 17 '16 at 13:31
3

Didn't work for me: `(main)> I18n.transliterate "ŠKODA" => "ŠKODA"` – Michael Jul 12 '16 at 14:30
Those cases should be reported as I18n bugs. – user2398029 Jul 22 '16 at 00:46
It depends too much on configuration, I think. Does not work for me too, tried specifying different locales. – kolen May 04 '18 at 17:06

score 35 · Answer 2 · answered Aug 06 '18 at 17:19

35

The parameterize method could be a nice and simple solution to remove special characters in order to use the string as human readable identifier:

> "Françoise Isaïe".parameterize
=> "francoise-isaie"

answered Aug 06 '18 at 17:19

AlexGuti

3,063
1
27
28

1

They're not using Rails, though. – snowangel Oct 27 '19 at 09:22
2

`parameterize` uses `I18n.transliterate`: https://github.com/rails/rails/blob/main/activesupport/lib/active_support/inflector/transliterate.rb – Dorian Sep 24 '21 at 19:51
thanks man! great thing xD – Wordica Feb 26 '22 at 18:52
note that this changes periods '`.`' into dashes '`-`' – Julien Mar 11 '22 at 23:35

score 20 · Answer 3 · answered Mar 29 '13 at 03:21

20

So far the following is the only way I've been able to accomplish what I need:

str.tr(
"ÀÁÂÃÄÅàáâãäåĀāĂăĄąÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňŉŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž",
"AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz")

But using this feels very 'hackish', and I would love to find a better way.

answered Mar 29 '13 at 03:21

Gus Shortz

1,711
1
15
24

1

This works only for ISO-8859-1. What makes you think it works for UTF-8? – pts Nov 29 '14 at 19:58
4

This one works for UTF-8 and ruby 2.2.3, and does exactly what I needed. Lacks some Romanian characters though. I've aded them: `string.tr( "ÀÁÂÃÄÅàáâãäåĀāĂăĄąÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňŉŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšȘșſŢţŤťŦŧȚțÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž", "AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSsSssTtTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz")` – Alexander Jun 24 '17 at 09:21
Thanks it worked. Lack some Vietnamese chars. I 've added them: `tr("ÀÁÂÃÄÅàáâãäåĀāĂăĄąạảÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêểệễëĒēĔĕĖėĘęĚěẹĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıịỉĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňŉŊŋÒÓÔÕÖØòóôộỗổõöøŌōŎŏŐőọỏơởợỡŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųụưủửữựŴŵÝýÿŶŷŸŹźŻżŽžứừửựữốồộỗổờóợỏỡếềễểệẩẫấầậỳỹýỷỵặẵẳằắ", "AAAAAAaaaaaaAaAaAaaaCcCcCcCcCcDdDdDdEEEEeeeeeeEeEeEeEeEeeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiiiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOoooooooooOoOoOoooooooRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuuuuuuuWwYyyYyYZzZzZzuuuuuooooooooooeeeeeaaaaayyyyyaaaaa")` – duyetpt Jul 16 '21 at 08:14

noname120 · Answer 4 · 2022-10-11T14:09:30.210

Solution:

DIACRITICS = [*0x1DC0..0x1DFF, *0x0300..0x036F, *0xFE20..0xFE2F].pack('U*')

def removeaccents(str)
  str
    .unicode_normalize(:nfd)
    .tr(DIACRITICS, '')
    .unicode_normalize(:nfc)
end

Example (before/after):

ÀÁÂÃÄÅàáâãäåĀāĂăĄąạảÇçĆćĈĉĊċČčĎďÈÉÊËèéêểệễëĒēĔĕĖėĘęĚěẹĜĝĞğĠġĢģĤĥÌÍÎÏìíîïĨĩĪīĬĭĮįİıịỉĴĵĶķĸĹĺĻļĽľÑñŃńŅņŇňÒÓÔÕÖòóôộỗổõöŌōŎŏŐőọỏơởợỡŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųụưủửữựŴŵÝýÿŶŷŸŹźŻżŽžứừửựữốồộỗổờóợỏỡếềễểệẩẫấầậỳỹýỷỵặẵẳằắ
AAAAAAaaaaaaAaAaAaaaCcCcCcCcCcDdEEEEeeeeeeeEeEeEeEeEeeGgGgGgGgHhIIIIiiiiIiIiIiIiIıiiJjKkĸLlLlLlNnNnNnNnOOOOOooooooooOoOoOoooooooRrRrRrSsSsSsSsſTtTtUUUUuuuuUuUuUuUuUuUuuuuuuuWwYyyYyYZzZzZzuuuuuooooooooooeeeeeaaaaayyyyyaaaaa

Explanations:

Decompose the single-codepoint characters into their constituting codepoints characters (where applicable).
Remove the diacritical mark codepoints (Unicode 15.0.0 reference) found in the following blocks:
- Combining Diacritical Marks Supplement (U+1DC0 → U+1DFF)
- Combining Diacritical Marks (U+0300 → U+036F)
- Combining Half Marks (U+FE20 → U+FE2F)
Recompose the characters.

Caveats:

While these diacritics are predominantly used for text, some of them can also be used with symbols. These symbols will see these diacritics removed when they shouldn't be.
Obscure codepoints such as subtending marks are not removed. Despite their naming, they are not treated as combining marks by the unicode reference but as format characters. An example is the arabic hamza above ◌ٔ (U+0654) that probably doesn't even get properly displayed in your browser.
Not a caveat per se but worth nothing: diacritics that are preceded by a space or a breaking space are also removed. They are displayed as standalone characters in some text-rendering software so it may be undesired.

I really like this solution. No gems or anything else needed. Just simple and clean code. I hope this gets more votes. IMO, this should be the accepted answer. — luis.madrigal, Nov 03 '22 at 06:02

score 6 · Answer 5 · edited Sep 24 '21 at 19:50

6

If you are using rails:

"L'Oréal".parameterize(separator: ' ')

edited Sep 24 '21 at 19:50

Dorian

7,749
4
38
57

answered Mar 21 '20 at 06:14

Navid Khan

979
11
24

Ruby method to remove accents from UTF-8 international characters

5 Answers5

Linked