0

I am trying to read a text file which contains many string with accents (punctuation), and fill a database with those string without these accents, using Ruby ( not On Rails).

For example I have:

J'ai été mise au courant des éventualités à temps.

I want to replace the whole line to have the following string:

J'ai ete mise au courant des eventualites a temps.

So, for that I found that method, which should work:

    def convert_to_ascii(s)
        undefined = ''
        fallback = { 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A',
                   'Å'=>'A', 'Æ'=>'AE', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
                   'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I',
                   'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O',
                   'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U',
                   'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'à'=>'a', 'á'=>'a',
                   'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'ae',
                   'ç'=>'c', 'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e',
                   'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ñ'=>'n',
                   'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o', 'ö'=>'o',
                   'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ü'=>'u',
                   'ý'=>'y', 'ÿ'=>'y' }

        s.encode('ASCII',fallback: lambda { |c| fallback.key?(c) ? fallback[c] : undefined })
   end

But it just gives me the following string:

J'ai t mise au courant des ventualits temps.

Or even:

J'ai �t� mise au courant des �ventualit�s temps.

I don't understand why it do not work...

EDIT:

I was using

file = File.open(i_FileName, 'r:utf-8')

To read the file, I replaced it by

file = File.open(i_FileName, 'r:iso-8859-1:utf-8')

And it works like a charm !

Siya
  • 101
  • 1
  • 1
  • 10

1 Answers1

2

TL;DR: Use String#unicode_normalize.

The unexpected result is provoked by that might be 1 symbol (so-called Unicode composed form) as well as 2 (two) different symbols (Unicode decomposed form.)

"J'ai été mise au courant des éventualités à temps.".
  unicode_normalize(:nfd).
  gsub(/./) { |m| m.ord > 255 ? '' : m }
#⇒ "J'ai ete mise au courant des eventualites a temps."

Or, even simplier:

"J'ai été mise au courant des éventualités à temps.".
  unicode_normalize(:nfd).gsub(/[\u0300-\u036F]/, '')
#⇒ "J'ai ete mise au courant des eventualites a temps."

What we are doing here is: we normalize the string to decomposed form (all combined diacritics become separate symbols.) Then we shave them off with String#gsub.


If you feel a pity to throw your existing code out, normalize the string to composed form and use your encode, now it’d work.

composed = "J'ai été mise au courant des éventualités à temps.".
   unicode_normalize(:nfc) # NOTE :nfc parameter

composed.encode(.....)
Aleksei Matiushkin
  • 119,336
  • 10
  • 100
  • 160
  • Thank you ! It works when I use a string that I define just before. But it doesn't work when I get a string from my text file, may be a problem with my Ruby script.. any way thank you – Siya Apr 26 '18 at 10:51
  • It surely works with anything _that has utf8 encoding_. If your file is e.g. in Latin1 you might want to read it in it’s original encoding, then [`String#encode`](https://ruby-doc.org/core/String.html#method-i-encode) it into utf8 and then the above should work. – Aleksei Matiushkin Apr 26 '18 at 11:07
  • I found the error, I was using file = File.open(i_FileName, 'r:utf-8') to read the file, I replaced it by file = File.open(i_FileName, 'r:iso-8859-1:utf-8') and it works like a charm ! Thanks – Siya Apr 26 '18 at 12:50