0

As the title states I am attempting to display Unicode accent marks next to letters.

This task comes from needing to iterate through a string, identify a special character and then "simplify" it by breaking the accent mark and letter and displaying them side by side(the word being correct doesnt matter, only formatting matters).

i.e. Às --> Aˋs

I already have the unicode needed, so I do not need to identify any of the characters.

I'm attempting to do this dynamically, so I've stored all of the special character unicode and replacement unicode in objects within an array. Rather than iterate through every single character within the string, I'm globally replacing every instance of the special character with the new combination of unicode characters i want. please see my current code below:

//String to check for special characters
var string_data = "Às simple as this sounds...it is trivial"

//Array of special(incompatible) characters and replacement unicode characters
var unicodeChars = [
{
    incompatible_unicode_char: "\u00C0",//À
    replace_uni_char_one: "\u0041", //A
    replace_uni_char_two: "\u0300" //ˋ
}
];

//Convert property values from unicodeChars objects to readable characters
for(var i = 0; i< unicodeChars.length;i++){ 
    String.fromCharCode(parseInt(unicodeChars[i].incompatible_unicode_char,16));
    String.fromCharCode(parseInt(unicodeChars[i].replace_uni_char_one,16));
    String.fromCharCode(parseInt(unicodeChars[i].replace_uni_char_two,16));
}

//Iterate through each object in unicodeChars array 
for(var i = 0; i<unicodeChars.length;i++){

  //Creating a string that holds the value of what to replace the special character with
  var replacement_chars = unicodeChars[i].replace_uni_char_one;
  if(unicodeChars[i].replace_uni_char_two != null){
    replacement_chars = replacement_chars + unicodeChars[i].replace_uni_char_two;
  }

  //creating regex object in order to globally replace any occurrence of the special character in the string
  var regex = new RegExp(unicodeChars[i].incompatible_unicode_char, "g");

  //attempting to replace the occurrence 
  string_data = string_data.replace(regex, replacement_chars);
 }

My desired end value of string_data is: Aˋs simple as this sounds...it is trivial

However the problem here is that the current end value is: Às simple as this sounds...it is trivial

So string_data is basically not changing at all, but at the same time it is. When investigating, I've found that adding characters and accent markers combines them into a single letter.

So in my code when I do the following: replacement_chars = replacement_chars + unicodeChars[i].replace_uni_char_two; the code automatically combines the accent mark from the unicodeChars[i].replace_uni_char_two with the standard letter held in replacement_chars.

I do not want this combining to take place, I wish to display them next to each other like Aˋs rather than Às. How do I stop javascript from automatically combining the accent mark and standard letter?

Please keep in mind that I need to keep the current structure of this code in place (the array of unicodeCharacters, converting unicode values to characters, and then using regex to perform a global replace) ahead of time and I wish to keep this solution dynamic as it currently is.

rze
  • 248
  • 1
  • 3
  • 14
  • 1
    Look up normalization - in particular, Normalization Form D (NFD) for splitting composed characters up into base characters and combining characters, instead of trying to do it manually on a case by base basis. Then if you don't want the combining characters to actually combine... um... – Shawn Nov 09 '18 at 09:38
  • ... *maybe* putting a`U+200C` [ZERO WIDTH NON-JOINER](https://en.wikipedia.org/wiki/Zero-width_non-joiner) between each codepoint might do the trick. – Shawn Nov 09 '18 at 09:43

2 Answers2

2

The problem arises because you are using a combining character instead of a modifier letter for the grave accent in your code example, so just change the value of replace_uni_char_two from \u0300 to \u02CB. To confirm that change fixes the issue, run this trivial JavaScript:

console.log('u00C0         : \u00C0');
console.log('u0041 + u0300 : \u0041\u0300  [Uses combining character for grave accent]');
console.log('u0041 + u02cb : \u0041\u02cb [Uses modifier letter for grave accent]');

Here's the output:

u00C0         : À
u0041 + u0300 : À  [Uses combining character for grave accent]
u0041 + u02cb : Aˋ [Uses modifier letter for grave accent]

Note that:

  • The decomposition of +U00C0 (À) is LATIN CAPITAL LETTER A (U+0041) plus COMBINING GRAVE ACCENT (U+0300).
  • COMBINING GRAVE ACCENT (U+0300) is a combining character which will be combined with the preceding character into a single glyph for rendering. This is the problem you need to fix in your code.
  • In contrast, the character which fixes your problem, MODIFIER LETTER GRAVE ACCENT (U+02CB), is visually very similar to COMBINING GRAVE ACCENT (U+0300), but it is a modifier letter. It will not be combined with the preceding character into a single glyph for rendering.

Therefore the general approach to fix your code is:

  • Determine the decomposition of each special character you have defined, which will probably be a base character followed by a single combining character.
  • Get the modifier letter counterpart of the combining character. The Unicode name of a combining character will include "COMBINING", and the name of its modifier letter counterpart will include "MODIFIER LETTER". For example: "COMBINING GRAVE ACCENT" vs "MODIFIER LETTER GRAVE ACCENT".
  • In your code declaration of unicodeChars specify the values of modifier letters rather than combining characters.

For more details on this non-trivial issue, see What is the difference between “combining characters” and “modifier letters”?

skomisa
  • 16,436
  • 7
  • 61
  • 102
1

How do I stop javascript from automatically combining the accent mark and standard letter?

You're blaming the wrong system, it's the font renderer that combines glyphs.


In Javascript, simply surround marks with spaces so that they stand alone.

XRegExp.replace(
    "Às simple as this sounds...it is trivial".normalize('NFD'),
    XRegExp('(\\p{Mark})'),
    ' $1 '
)
daxim
  • 39,270
  • 4
  • 65
  • 132