1

I would like to convert characters with accents or similar to the corresponding ordinary character:

  • á, à, â should become "a"
  • é, ê should be e
  • Ç to C
  • etc.

It could be done by concatenating a million .replace(...) calls, but I'm looking for a more elegant solution. The difficulty is to find out which ordinary character belongs to which extended character. I can easily see that an á is an extension of an a. But how do I automate this step?

Why I want to do this:

I have an interface between two applications. Application One provides data that contains said accents. Application Two can only work with data that matches [a-zA-Z].

k0pernikus
  • 60,309
  • 67
  • 216
  • 347
Florian-Rh
  • 777
  • 8
  • 26
  • Those are accents, not apostrophes. – SLaks Dec 11 '17 at 13:58
  • Related: https://stackoverflow.com/questions/286921/efficiently-replace-all-accented-characters-in-a-string – k0pernikus Dec 11 '17 at 13:59
  • There seems to be a library for that purpose: [latinize](https://www.npmjs.com/package/latinize) – k0pernikus Dec 11 '17 at 14:01
  • 2
    Possible duplicate of [Efficiently replace all accented characters in a string?](https://stackoverflow.com/questions/286921/efficiently-replace-all-accented-characters-in-a-string) – k0pernikus Dec 11 '17 at 14:03
  • Added it as an answer, added an explanation on how the library works internally. – k0pernikus Dec 13 '17 at 11:56
  • And in regards to your disliking of `replace` calls: All characters are internally random numbers in the bytecode we humans assigned meaning. So even though you know that `â` should relate to `a` by its design, the computer only knows that it sees a hex digit `c382` and you want to to be `61`. There is no connection between the two numbers and the numbers could be different anyways. There is no way to compute meaning. You'll have to map that data yourself. – k0pernikus Dec 13 '17 at 12:06
  • @SLaks The correct term is [diacritics](https://en.wikipedia.org/wiki/Diacritic), actually. – Nyerguds Feb 03 '18 at 14:22

1 Answers1

1

You can use the library latinize, installable through:

npm install latinize

Since you are using typescript, you also can get its typing:

npm install @types/latinize

Usage:

var latinize = require('latinize');
latinize('ỆᶍǍᶆṔƚÉ áéíóúýčďěňřšťžů'); // => 'ExAmPlE aeiouycdenrstzu'

Internally, it replaces each character that is not a latin char or an arabic number through a regex and a callback function.

function latinize(str) {
    if (typeof str === 'string') {
      return str.replace(/[^A-Za-z0-9]/g, function(x) {
        return latinize.characters[x] || x;
      });
    } else {
      return str;
    }
}

and it finds the target character via the help of a predefined character lookup table.


In the end, this solution is also a search and replace approach. I know you want to automate the discovery of the characters, yet the font system doesn't work that way.

The computer and hence JavaScript is unaware of the design and the meaning of a character. Instead, a character is nothing but a random number we use to identify a symbol. And that system is quite arbitrary and there is not much of an internal consistency.

So even though you know that â should relate to a by its design, the computer only knows that in UTF8 it has a digit U+00E2. You want it to be U+0061 though.

Yet there is no connection just from knowing the number. You would have to compare the symbol and that's hardly possible, esp. if you get down to very similar looking symbols, e.g. Α U+0391 to A U+0041.

There is no way to compute meaning. You'll have to map an extended character to its Latin counterpart yourself (or via the help of a library).

k0pernikus
  • 60,309
  • 67
  • 216
  • 347
  • 1
    Note that *â* is U+00C2, not U+C382. But getting from *â* to *a* is simple via normalization. Getting from *ff* (U+FB00) to *ff* (U+0066 U+0066) is a bit more complicated, and getting from *Α* (U+0391) to *A* (U+0041) requires custom mappings based on appearance. – Joey Dec 15 '17 at 10:30
  • I agree that there is no formula that works for all cases. however you may use the official description of the UTF8 char to automatically construct a map which works for accents, diacritics, "ff", "Α", and so on. This is what [ubase.js](https://www.npmjs.com/package/ubase.js) does. – sanette Feb 22 '23 at 20:00