80

I have an ActiveRecord model, Foo, which has a name field. I'd like users to be able to search by name, but I'd like the search to ignore case and any accents. Thus, I'm also storing a canonical_name field against which to search:

class Foo
  validates_presence_of :name

  before_validate :set_canonical_name

  private

  def set_canonical_name
    self.canonical_name ||= canonicalize(self.name) if self.name
  end

  def canonicalize(x)
    x.downcase.  # something here
  end
end

I need to fill in the "something here" to replace the accented characters. Is there anything better than

x.downcase.gsub(/[àáâãäå]/,'a').gsub(/æ/,'ae').gsub(/ç/, 'c').gsub(/[èéêë]/,'e')....

And, for that matter, since I'm not on Ruby 1.9, I can't put those Unicode literals in my code. The actual regular expressions will look much uglier.

paradoja
  • 3,055
  • 2
  • 25
  • 34
James A. Rosen
  • 64,193
  • 61
  • 179
  • 261
  • 2
    even in 1.8 you can, use "ruby -Ku" – Keltia Jan 23 '09 at 19:16
  • 1
    This problem has long since been solved and there are many great comments below. Re-reading it now, I want to make one thing clear: the idea was to create a version of the text that was searchable with just ASCII character, *not* to actually coerce the data. Note that there are two database properties: `name` and `canonical_name`. I do *not* advocate trashing the actual data, merely creating a way of searching through it without diacritic marks, which users of all languages often leave off. – James A. Rosen Aug 20 '11 at 17:27
  • 1
    Actually, every single of these is the wrong answer. You need to use Unicode Collation Algorithm with a comparison strength set to level 1 only. Everything else is screwed up. – tchrist Jan 15 '12 at 16:19
  • 8
    @tchrist so you showed up to the discussion to say "those guys are wrong" but didn't offer anything more than the barest of answers? o_O Please answer the question for real just so I can downvote you for being obnoxious. – jcollum Jan 22 '12 at 18:55
  • @tchrist "wrong" may depend on individual requirements. True, being wrong may come back to haunt someone who doesn't know the ramifications (and consequently didn't add the requirement they would have added if they knew better). But until they are told said ramifications, they won't heed the suggestion. – Kelvin Mar 08 '13 at 23:32
  • @JamesA.Rosen I belive that you didn't have no intention to convert swedish/danish to some nonsense. I do get annoyed that some 'swedish' programmers referenced to this post as a way to implement/make search (easier [for the programmer]) now almost 5 years later. – Jonke Apr 29 '13 at 08:39

15 Answers15

101

ActiveSupport::Inflector.transliterate (requires Rails 2.2.1+ and Ruby 1.9 or 1.8.7)

example:

>> ActiveSupport::Inflector.transliterate("àáâãäå").to_s => "aaaaaa"

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
Mark Wilden
  • 2,080
  • 1
  • 15
  • 15
63

Rails has already a builtin for normalizing, you just have to use this to normalize your string to form KD and then remove the other chars (i.e. accent marks) like this:

>> "àáâãäå".mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').downcase.to_s
=> "aaaaaa"
rogerdpack
  • 62,887
  • 36
  • 269
  • 388
unexist
  • 2,518
  • 23
  • 27
  • % ruby -v ruby 1.8.7 (2008-08-11 patchlevel 72) [i686-linux] – unexist Dec 06 '08 at 19:23
  • Thanks, didn't know that functionality existed in Rails. The method name was different in my Rails version: "àáâãäå".mb_chars. –  Jan 14 '09 at 01:15
  • 1
    +1 for form KD, which will also turn ligatures like 'ffi' to 'ffi'. – Christian - Reinstate Monica C Feb 24 '09 at 17:33
  • 4
    I'm trying to use this in another script outside a Rails app. I thought it'd be in `activesupport`, but after requiring it I still get a `NoMethodError` for `normalize`. Do you know what I have to require? – agentofuser Oct 30 '09 at 16:28
  • 4
    It is in activesupport, but you will have to do it like this: ActiveSupport::Multibyte::Chars.new("àáâãäå").mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').downcase.to_s – unexist Nov 16 '09 at 07:56
  • 7
    This works great, but I had to do `mb_chars` like Christian. `foo.mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').to_s.split` – Sam Soffes Nov 24 '09 at 01:03
  • 1
    One more tip: if you get "NoMethodError: undefined method `normalize'", you may also need to explicitly set $KCODE = 'u' to force the default encoding for strings into Unicode. – lambshaanxy Mar 18 '10 at 03:03
  • 51
    At least in Rails3, String#parameterize works ... so "öüâ".parameterize == "oua" – foz Jul 13 '11 at 22:46
  • Nope. You need to use the UCA at level 1 for this problem. – tchrist Jan 15 '12 at 16:19
  • I tested @foz solution and worked like a charm! I must add that parameterize is far cleaner and avoids future annoying warnings, i.e: `"àáâãäå".mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').split(/a/)` shows "warning: regexp match /.../n against to UTF-8 string" (it works anyway). On the other hand `"àáâãäå".parameterize.split(/a/)` shows no warning – Redithion Mar 01 '16 at 03:18
  • In essence, the beginning of this answer is what the `transliterate` method does, and if you add in the `downcase`, it's similar to what the `parameterize` method does. – rogerdpack Jun 11 '16 at 12:24
  • parameterize is for urls -> it replaces spaces with `-` so not very useful for general use. – Karthik T Sep 05 '16 at 10:46
  • parameterize(' ') works outside URLs, too; it uses spaces instead if dashes. In my case, I needed to preserve stuff like parantheses etc., so parameterize was not an option. The above mb_chars code resulted in the warning "regexp match /.../n against to UTF-8 string" (which is exactly what the regexp *wants to do* if I understood it correctly), so I also needed Kernel::silence_warnings. Which is ugly, but works. – haslo Sep 28 '16 at 07:51
44

Better yet is to use I18n:

1.9.3-p392 :001 > require "i18n"
 => false
1.9.3-p392 :002 > I18n.transliterate("Olá Mundo!")
 => "Ola Mundo!"
Diego Moreira
  • 541
  • 4
  • 4
  • 1
    in normal ruby (non rails) I get: LoadError: cannot load such file -- i18n rails library? Anyway as a note the rails method ActiveSupport::Inflector.transliterate actually calls the I18n one under the covers (after doing a normalize to make sure it can remove all diacritical marks) – rogerdpack Jun 11 '16 at 12:25
  • 1
    For `cannot load such file -- i18n`, just `sudo gem install i18n`. – Camille Goudeseune Jul 27 '16 at 22:02
  • 2
    I got an error like `:en is not a valid locale (I18n::InvalidLocale)` until I added `I18n.available_locales = [:en]`. This also replaces non-ASCII characters that are not Latin letters with diacritics with question marks, so that for example `ruby -ri18n -ne'I18n.available_locales=[:en];puts I18n.transliterate$_'<<<Дあ☆` prints `???`. – nisetama Aug 12 '16 at 19:25
20

I have tried a lot of this approaches but they were not achieving one or several of these requirements:

  • Respect spaces
  • Respect 'ñ' character
  • Respect case (I know is not a requirement for the original question but is not difficult to move an string to lowcase)

Has been this:

# coding: utf-8
string.tr(
  "ÀÁÂÃÄÅàáâãäåĀāĂ㥹ÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž",
  "AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz"
)

http://blog.slashpoundbang.com/post/12938588984/remove-all-accents-and-diacritics-from-string-in-ruby

You have to modify a little bit the character list to respect 'ñ' character but is an easy job.

fguillen
  • 36,125
  • 23
  • 149
  • 210
  • Can you elaborate on what you mean about having to modify the character list to respect the character `ñ`? It seems to me that it is already in the list and aligned with `n`. – user664833 Sep 11 '13 at 16:26
  • To respect the `ñ` character I mean to NOT transform it into `n` character but keep it. – fguillen Sep 11 '13 at 16:43
  • I see. Can you say why this character is special (I mean, why it is singled out for respect)? – user664833 Sep 11 '13 at 23:33
  • Could be a lot of reasons, the most common is that it is not an ASCII character. – fguillen Sep 12 '13 at 10:45
  • 1
    Sorry, but I still don't understand why you chose to single out `ñ` in your bulleted list of requirements. `ñ` is the "Latin small letter n with tilde", and it is in the extended ASCII set, along with many of the others in your list - see http://www.ascii-code.com/ - whereas there are a number of characters in your list that are *not* in the extended ASCII set, including `Ą` and `Ħ`. So I am still confused as to why you singled out `ñ`. – user664833 Sep 12 '13 at 17:48
  • 1
    `ActiveSupport::Inflector.transliterate` seems to satisfy your requirements except for the "retaining ñ" and also this way is pure ruby which is nice. Unfortunately with unicode you can do weird stuff lik add an umlaut over basically *any* preceding char, so this approach will be hard to get comprehensive enough to meet all situations :| – rogerdpack Jun 11 '16 at 12:14
  • Also, there are multiple different ways to have an 'ä', for example. Just bit me right now. `'ä'.ord => 97`, and `'ä'.ord => 228`. Won't work with copy-paste because StackOverflow does the sensible thing and normalizes them. – haslo Sep 27 '16 at 16:44
  • -1: Respecting just ñ is a too narrow requirement: The question is about transliterating *all* accented characters, not just those that you happen to know about personally. Actually Unicode gains new characters all the time, so even if your list is complete today, it may fail with the next Unicode release. – toolforger Jan 11 '20 at 11:42
13

My answer: the String#parameterize method:

"Le cœur de la crémiére".parameterize
=> "le-coeur-de-la-cremiere"

For non-Rails programs:

Install activesupport: gem install activesupport then:

require 'active_support/inflector'

"a&]'s--3\014\xC2àáâã3D".parameterize
# => "a-s-3-3d"
rogerdpack
  • 62,887
  • 36
  • 269
  • 388
Dorian
  • 22,759
  • 8
  • 120
  • 116
  • There is a vast difference with . parameterize versus ActiveSupport::Inflector.transliterate Input: " Don't Fall Into This Programming Trap" .parameterize gives: "don-t-fall-into-this-programming-trap" ActiveSupport::Inflector.transliterate gives: "? Don't Fall Into This Programming Trap" That's a huge, huge difference. – fuzzygroup Jul 11 '17 at 12:50
  • @fuzzygroup Using the code formatting of markdown (e.g. `\`method\``) helps for the reading part of the comment. And to answer your question, `"Le cœur de la crémiére".parameterize` is the best UTF-8 to ASCII for urls, it's suuper nice and sweet – Dorian Jul 13 '17 at 01:55
8

Decompose the string and remove non-spacing marks from it.

irb -ractive_support/all
> "àáâãäå".mb_chars.normalize(:kd).gsub(/\p{Mn}/, '')
aaaaaa

You may also need this if used in a .rb file.

# coding: utf-8

the normalize(:kd) part here splits out diacriticals where possible (ex: the "n with tilda" single character is split into an n followed by a combining diacritical tilda character), and the gsub part then removes all the diacritical characters.

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
Cheng
  • 4,816
  • 4
  • 41
  • 44
  • See also unexist's answer here, which in essence does this but with a 1.8.x compatible regex. – rogerdpack Jun 11 '16 at 12:16
  • 3
    This should be way higher. Other solutions completely strip other character sets (e.g. `I18n.transliterate('日本語') #=> "???"`) and `'日本語'.parameterize #=> ""`. This answer is the closest fit to my needs, which is to be able to approximately match diverse datasets on titles/authors. `'日本語 àáâãäå'.unicode_normalize(:nfkd).gsub(/\p{Mn}/, '') #=> "日本語 aaaaaa"` – Bo Jeanes May 19 '17 at 02:04
7

I think that you maybe don't really what to go down that path. If you are developing for a market that has these kind of letters your users probably will think you are a sort of ...pip. Because 'å' isn't even close to 'a' in any meaning to a user. Take a different road and read up about searching in a non-ascii way. This is just one of those cases someone invented unicode and collation.

A very late PS:

http://www.w3.org/International/wiki/Case_folding http://www.w3.org/TR/charmod-norm/#sec-WhyNormalization

Besides that I have no ide way the link to collation go to a msdn page but I leave it there. It should have been http://www.unicode.org/reports/tr10/

Jonke
  • 6,525
  • 2
  • 25
  • 40
  • I'm all for database collation, but someone might switch databases a year after I leave; I'd prefer to be defensive and at least do it in code, and possibly also in the DB. As for forcing the users to type what they mean: how many English users type résumé? Or "visual café"? – James A. Rosen Nov 17 '08 at 23:40
  • In the strip the poster has the letter å and ä. If you remove those to a the meaning of the word they are in are meaningless. You can't strip those and use what is left. If You really work for a European market you better learn to search with something, instead of trashing the users data. – Jonke Nov 18 '08 at 07:24
  • In Slovak language, for example á, ä is very close to a. And so are all accented characters to those without accent. Lots of people don't use these at all in IM, etc. – Vojto Jul 02 '10 at 07:52
  • @Vojto: In most nordern european languages, accented charachters are far away from the unaccented versions. In fact the are symbols of very different sounds. The german öl for example (http://en.bab.la/dictionary/german-english/oel). Or the swedish words ål (eal) and al (a tree). – Jonke Jul 02 '10 at 08:22
  • Cool, I just wanted to note, that that's not necessarily true for all European languages. I mentioned Slovak, but it's the same also for Czech, Polish, Croatian I guess and pretty much all Slavic languages. And it's very important that search engines, etc. support searching by unaccented characters - because in most cases people are just too lazy to type accents. – Vojto Jul 03 '10 at 20:09
  • You need to use a comparison based on the UCA for this, and at level 1 only, plus you need to use the UCA tailoring for the working locale if you want things to compare the way people in that locale are expecting if it differs from the standard collation. – tchrist Jan 15 '12 at 16:21
  • @tchrist I (naively) thought that was precisely what I suggested. – Jonke Apr 29 '13 at 08:46
4

This assumes you use Rails.

"anything".parameterize.underscore.humanize.downcase

Given your requirements, this is probably what I'd do... I think it's neat, simple and will stay up to date in future versions of Rails and Ruby.

Update: dgilperez pointed out that parameterize takes a separator argument, so "anything".parameterize(" ") (deprecated) or "anything".parameterize(separator: " ") is shorter and cleaner.

Sudhir Jonathan
  • 16,998
  • 13
  • 66
  • 90
3

Convert the text to normalization form D, remove all codepoints with unicode category non spacing mark (Mn), and convert it back to normalization form C. This will strip all diacritics, and your problem is reduced to a case insensitive search.

See http://www.siao2.com/2005/02/19/376617.aspx and http://www.siao2.com/2007/05/14/2629747.aspx for details.

Community
  • 1
  • 1
CesarB
  • 43,947
  • 7
  • 63
  • 86
3

The key is to use two columns in your database: canonical_text and original_text. Use original_text for display and canonical_text for searches. That way, if a user searches for "Visual Cafe," she sees the "Visual Café" result. If she really wants a different item called "Visual Cafe," it can be saved separately.

To get the canonical_text characters in a Ruby 1.8 source file, do something like this:

register_replacement([0x008A].pack('U'), 'S')
rogerdpack
  • 62,887
  • 36
  • 269
  • 388
James A. Rosen
  • 64,193
  • 61
  • 179
  • 261
2

You probably want Unicode decomposition ("NFD"). After decomposing the string, just filter out anything not in [A-Za-z]. æ will decompose to "ae", ã to "a~" (approximately - the diacritical will become a separate character) so the filtering leaves a reasonable approximation.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • 1
    All these answers involving normalization forms are every one of them wrong. You need a UCA level-1 comparison, possibly with locale tailoring. – tchrist Jan 15 '12 at 16:22
  • 1
    @tchrist: If you want to give an alternative answer, feel free. If you want to point out why my answer doesn't work, use can use a comment, but then at least point out _why_ it doesn't work. (Hint: read the title of the question first; UCA comparison does **not** replace accented characters). – MSalters Jan 16 '12 at 09:07
  • See also Cheng's answer for an example – rogerdpack Jun 11 '16 at 11:35
1

iconv:

http://groups.google.com/group/ruby-talk-google/browse_frm/thread/8064dcac15d688ce?

=============

a perl module which i can't understand:

http://www.ahinea.com/en/tech/accented-translate.html

============

brute force (there's a lot of htose critters!:

http://projects.jkraemer.net/acts_as_ferret/wiki#UTF-8support

http://snippets.dzone.com/posts/show/2384

Gene T
  • 5,156
  • 1
  • 24
  • 24
0

I had problems getting the foo.mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').downcase.to_s solution to work. I'm not using Rails and there was some conflict with my activesupport/ruby versions that I couldn't get to the bottom of.

Using the ruby-unf gem seems to be a good substitute:

require 'unf'
foo.to_nfd.gsub(/[^\x00-\x7F]/n,'').downcase

As far as I can tell this does the same thing as .mb_chars.normalize(:kd). Is this correct? Thanks!

0

If you are using PostgreSQL => 9.4 as your DB adapter, maybe you could add in a migration it's "unaccent" extension that I think does what you want, like this:

def self.up
   enable_extension "unaccent" # No falla si ya existe
end

In order to test, in the console:

2.3.1 :045 > ActiveRecord::Base.connection.execute("SELECT unaccent('unaccent', 'àáâãäåÁÄ')").first
 => {"unaccent"=>"aaaaaaAA"}

Notice there is case sensitive up to now.

Then, maybe use it in a scope, like:

scope :with_canonical_name, -> (name) {
   where("unaccent(foos.name) iLIKE unaccent('#{name}')")
}

The iLIKE operator makes the search case insensitive. There is another approach, using citext data type. Here is a discussion about this two approaches. Notice also that use of PosgreSQL's lower() function is not recommended.

This will save you some DB space, since you will no longer require the cannonical_name field, and perhaps make your model simpler, at the cost of some extra processing in each query, in an amount depending of whether you are using iLIKE or citext, and your dataset.

If you are using MySQL maybe you can use this simple solution, but I have not tested it.

user2553863
  • 682
  • 1
  • 8
  • 17
-3

lol.. i just tryed this.. and it is working.. iam still not pretty sure why.. but when i use this 4 lines of code:

  • str = str.gsub(/[^a-zA-Z0-9 ]/,"")
  • str = str.gsub(/[ ]+/," ")
  • str = str.gsub(/ /,"-")
  • str = str.downcase

it automaticly removes any accent from filenames.. which i was trying to remove(accent from filenames and renaming them than) hope it helped :)