1

I understand (after some pain...), that the translate function will not handle multibyte unicode. I am looking for a solution to this in order to remove all accents from characters. As a sample I have the following transform and its output:

<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xsl:output method="text" encoding="UTF-8"/>
  <xsl:variable name="RSEP" select="'&#10;'"/>  <!-- LF -->
  <xsl:template match="/">
    <xsl:variable name="testwords" select="'à wɔ́rɔ, yɛrɛ, wùri'"/>
    <xsl:value-of select="$testwords"/>
    <xsl:value-of select="$RSEP"/>
    <xsl:value-of select="translate($testwords,
      'àáèéɛ̀ɛ́ɔɔ̀ɔ́ìíòóuùú',
      'aaeeɛɛɔɔɔiioouuu')"/>
    <xsl:value-of select="$RSEP"/>
    <xsl:value-of select="normalize-unicode($testwords)"/>
    <xsl:value-of select="$RSEP"/>
    <xsl:value-of select="replace(normalize-unicode($testwords, 'NFKD'), '\P{IsBasicLatin}', '')"/>
    <xsl:value-of select="$RSEP"/>
  </xsl:template>
</xsl:stylesheet>

Output with xslt3:

à wɔ́rɔ, yɛrɛ, wùri
a wɔɔrɔ, yɛrɛ, wri
à wɔ́rɔ, yɛrɛ, wùri
a wr, yr, wuri

I realize the translate function is not expected to work. But using normalize-unicode does not seem to make any change to the string. And using a 'replace' function scoured elsewhere only seems to process the standard western european accented characters, but not the multibyte.

I have a feeling this may require some kind of regex, but I am just not sure how to go about that. Any help here appreciated.

Thanks!

Boyd
  • 351
  • 4
  • 14
  • Ah if you mean the terminal it is also: LANG=en_CA.utf8 – Boyd May 15 '23 at 18:00
  • 2
    See if these help: https://stackoverflow.com/questions/5398127/how-do-i-strip-accents-from-characters-in-xsl, https://stackoverflow.com/questions/56989053/how-to-fix-a-special-character-in-xslt. – michael.hor257k May 15 '23 at 18:02
  • Yes I saw that. But notice the comment near the end: Although be warned that any characters which can't be decomposed and aren't basic ASCII (Norwegian ø or Icelandic Þ for example) will be completely deleted from the string, but that's probably okay with your requirements. (this is exactly the point. This solution *strips out* the multbyte characters, exactly as my sample demonstrates. But its not ok for my requirements. I need a solution that replaces those characters with the 'no accent' version. Too bad the translated function does not work here. – Boyd May 15 '23 at 18:22
  • 2
    ok the second post seems to work! replace(normalize-unicode($string, 'NFD'), '\p{Mn}', '').... I will have to do some further testing.... – Boyd May 15 '23 at 18:35
  • On reflection this is probably a duplicate of https://stackoverflow.com/questions/5398127/how-do-i-strip-accents-from-characters-in-xsl?noredirect=1&lq=1 – Michael Kay May 15 '23 at 23:06

1 Answers1

2

You're really confusing matters by talking about "multi-byte" Unicode characters. The number of bytes occupied by a character is determined by the encoding (for example, in UTF8 encoding, codepoints in the range 0-127 occupy one byte), but XSLT operations don't depend in any way on the encoding, XSLT is only interested in Unicode as a sequence of codepoints.

What you are actually talking about here are what Unicode calls combining and modifier characters. There's a great description of these here:

What is the difference between "combining characters" and "modifier letters"?

An ordinary character followed by one or more combining or modifier characters can be considered as some kind of composite character, and it is this composite character that you are referring to as a "multi-byte character".

Now we get to unicode normalization, because some of these "composite characters" have two possible representations, a "composed form" using a single codepoint, and a "decomposed form" comprising a base character and one or more modifiers. When you use the translate() function in XSLT, the result will depend on which form the data takes, and you can force it into either form by using the normalize-unicode() function.

If you are trying to remove modifiers (such as diacritical marks) from the input then you can force the string into decomposed form, and then call replace() to remove codepoints in the relevant character category (or categories).

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • ok Thanks for the correction. I did not have the correct terminology! I did finally get a working transfrom with: replace(normalize-unicode($string, 'NFD'), '\p{Mn}', ''). – Boyd May 16 '23 at 11:24
  • @Boyd: I suggest that you accept this thorough and helpful answer. – kjhughes May 16 '23 at 12:38
  • Yes absolutely! Was just getting back to this.... to confirm that this also helped me move from To a more reliable: And this is working perfectly! – Boyd May 16 '23 at 14:40