Custom regular expression i18n

Question

I'm using Rails 3.2.

I'm localizing my site in Romanian. In regular expressions, the regexp interval [a-z] should contain, in order, the following letters: a, ă, â, b, c etc.

Is there a way to tell my application that [a-z] should be the list above, based on my locale?

Also, there is an issue with capitalizing - "â".upcase doesn't result in "Â".

Or, maybe these features are not implemented yet in Rails?

Have you looked into [transliteration](http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-transliterate)? — Paul Fioravanti, Jun 05 '13 at 10:48
@sawa, the reason I brought it up was for the potential for using an ascii-based regex with a post-ASCII-transliterated UTF-8 string, but I've never tried to see if that's a good way to solve the problem. Anyway, [this SO thread](http://stackoverflow.com/q/1910573/567863) may serve of some assistance to user1304740 regarding what's possible with i18n upcasing in Ruby. — Paul Fioravanti, Jun 05 '13 at 11:01
@PaulFioravanti - thanks, but it's not applicable to my case (I don't want to get rid of non-Ascii characters). — George, Jun 05 '13 at 12:13

score 1 · Accepted Answer · answered Jun 05 '13 at 10:52

This is not a rails issue, [a-z] is not required to include non-ascii characters. In ruby's case, [a-z] represents a regex range matching consecutive ascii letters.

In ruby, String.upcase is not required to be locale-dependent. Instead, you can try using UnicodeUtils gem like so:

% gem install unicode_utils

#encoding: UTF-8
require 'unicode_utils'

p UnicodeUtils.upcase('ă', :ro)

"Ă"

Specifying locale when converting string case makes more sense, because for example:

 UnicodeUtils.upcase('i', :en) # is not equal to 
 UnicodeUtils.upcase('i', :tr)

score 0 · Answer 2 · answered Jun 05 '13 at 10:58

0

I think [a-z] sequence is based on the ASCII code number, so Romanian characters will not be taken into consideration. If you want to match any Latin character, use the character property of Onigmo:

"ă" =~ /\p{Latin}/
# => 0

answered Jun 05 '13 at 10:58

sawa

165,429
45
277
381

Thanks, I will investigate Onigmo. – George Jun 05 '13 at 12:25

Custom regular expression i18n

2 Answers2