3

The question has been asked in other programming languages, but how would you perform an accent insensitive regex on Ruby ?

My current code is something like

scope :by_registered_name, ->(regex){
  where(:name => /#{Regexp.escape(regex)}/i)
}

I thought maybe I could replace non-alphanumeric+whitespace characters by dots, and remove the escape, but is there not a better way ? I'm afraid I could catch weird things if I do that...

I am targeting French right now, but if I could also fix it for other languages that would be cool.

I am using Ruby 2.3 if that can help.


I realize my requirements are actually a bit stronger, I also need to catch things like dashes, etc. I am basically importing a school database (URL here, the tag is <nom>), and I want people to be able to find their schools by typing its name. Both the search query and search request may contain accents, I believe the easiest way would be to make "both" insensitive.

  • "Télécom" should be matched by "Telecom"
  • "établissement" should be matched by "etablissement"
  • "Institut supérieur national de l'artisanat - Chambre de métiers et de l'Artisanat en Moselle" should be matched by "artisanat chambre de métiers
  • "Ecole hôtelière d'Avignon (CCI du Vaucluse)" Should be matched by Ecole hoteliere d'avignon" (for the parenthesis it's okay to skip it)
  • "Ecole française d'hôtesses" should be matched by "ecole francaise d'hot"

Also crazy stuff I found in that DB, I will consider sanitizing this input I think

  • "Académie internationale de management - Hotel & Tourism Management Academy" Should be matched by "Hotel Tourism" (note the & is actually written &amp; in the XML)
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Cyril Duchon-Doris
  • 12,964
  • 9
  • 77
  • 164
  • 1
    Can you edit your question to include a couple examples of the kinds of input you want to handle and what the corresponding results should be? – Jordan Running May 06 '16 at 19:24
  • 1
    In some languages there's a huge difference between 'a' and 'å'. French is largely indifferent. Do you have a collation preference? – tadman May 06 '16 at 19:41
  • See [regexp_extension.rb](https://gist.github.com/ssaunier/5858773). Looks like a port of the code at [*Programatic Accent Reduction in JavaScript (aka text normalization or unaccenting)*](http://stackoverflow.com/questions/227950/programatic-accent-reduction-in-javascript-aka-text-normalization-or-unaccentin/228006#228006). – Wiktor Stribiżew May 06 '16 at 19:53
  • Aouch, I had not really looked at the database before (shame on me... after all the datascience courses I took ;'( ), but it turns out I also have more special characters to handle, see my edit. – Cyril Duchon-Doris May 06 '16 at 20:09
  • Aaaaah, there was an awesome answer by Jordan using ActiveSupport::Inflector, why delete it :'( – Cyril Duchon-Doris May 06 '16 at 20:10
  • 1
    @CyrilDuchon-Doris I deleted it because I don't think it answers the question. It explained how to remove diacritics from a Regexp, but the resulting Regexp will not match strings with diacritics, which is what I think OP is trying to do. – Jordan Running May 06 '16 at 20:12
  • @Jordan I see. Well actually you also made me realize that I would need to also sanitize search queries (because someone may also search with accents). I have your answer copy pasted if you want it back. – Cyril Duchon-Doris May 06 '16 at 20:13
  • I can undelete it, which I'll do after reworking it a bit. – Jordan Running May 06 '16 at 20:14
  • What database is the data going into? – Jordan Running May 06 '16 at 20:15
  • A MongoDB database. If you were inquiring about the technology. – Cyril Duchon-Doris May 06 '16 at 20:16
  • Undeleted. Hope it helps! – Jordan Running May 06 '16 at 20:36

1 Answers1

2

It looks like the solution for MongoDB is to use a text index, which is diacritic insensitive. French is supported.

It's been a long time since I last used MongoDB, but if you're using Mongoid I think you would create a text index in your model like this:

index(name: "text")

...and then search like this:

scope :by_registered_name, ->(str) {
  where(:$text => { :$search => str })
}

Consult the documentation for the $text query operator for more information.

Original (wrong) answer

As it turns out I was thinking about the question backwards, and wrote this answer initially. I'm preserving it since it might still come in handy. If you were using a database that didn't offer this kind of functionality (like, it seems, MongoDB does), a possible workaround would be to use the following technique to store a sanitized name along with the original name in the database, and then likewise sanitize queries.

Since you're using Rails you can use the handy ActiveSupport::Inflector.transliterate:

regex = /aäoöuü/
transliterated = ActiveSupport::Inflector.transliterate(regex.source, '\?')
# => "aaoouu"
new_regex = Regexp.new(transliterated)
# => /aaoouu/

Or simply:

Regexp.new(ActiveSupport::Inflector.transliterate(regex.source, '\?'))

You'll note that I supplied '\?' as the second argument, which is the replacement string that will replace any invalid UTF-8 characters. This is because the default replacement string is "?", which as you know has special meaning in a regular expression.

Also note that ActiveSupport::Inflector.transliterate does a little bit more than the similar I18n.transliterate. Here's its source:

def transliterate(string, replacement = "?")
  I18n.transliterate(ActiveSupport::Multibyte::Unicode.normalize(
    ActiveSupport::Multibyte::Unicode.tidy_bytes(string), :c),
      :replacement => replacement)
end

The innermost method call, ActiveSupport::Multibyte::Unicode.tidy_bytes, cleans up any invalid UTF-8 characters.

More importantly, ActiveSupport::Multibyte::Unicode.normalize "normalizes" the characters. For example, looks like one character but it's actually two: LATIN SMALL LETTER E and COMBINING CIRCUMFLEX ACCENT. Calling I18n.transliterate("ê") would yield e?, which probably isn't what you want, so normalize is called to turn into ê, which is just one character: LATIN SMALL LETTER E WITH CIRCUMFLEX. Calling I18n.transliterate on (the former) would yield e?, which probably isn't what you want, so that normalize step before transliterate is important. (If you're interested in how that works, read about Unicode equivalence and normalization.)

Jordan Running
  • 102,619
  • 17
  • 182
  • 182
  • Thank you for the quality of your answer. I appreciate the links to docs and going back to edit your answer. I guess those 47k rep are well deserved. – Cyril Duchon-Doris May 06 '16 at 21:30
  • Sorry for the wait, this works perfect and the text index from MongoDB does even stronger interpolation (and somewhat tolerant against misspelling), which is perfect for my use case. Answer accepted. – Cyril Duchon-Doris May 09 '16 at 19:50