Automatically add an attribute and values based on Latinized characters between element

Question

I'm using Oxygen XML editor 23.1. I'm working on a large corpus of text and would like to use the transformation to automatically add certain attributes and values to certain elements. In this case, I have a @correspUnic attribute, created to add ugaritic glyphs from unicode decimal. The values of @correspUnic depend on the Latinized characters between the elements. Here's an example of tei encoding:

<w>bn</w>
<g>.</g>
<name>qdš</name>
<w>
  <seg>ʾa</seg>
  <unclear>b̊</unclear>
</w>

Expected result:

<w correspUnic='&#66433;&#66448;'>bn</w>
<g correspUnic='&#66463;'>.</g>
<name correspUnic='&#66454;&#66436;&#66444;'>qdš</name>
<w>
  <seg correspUnic='&#x10380;'>ʾa</seg>
  <unclear correspUnic='&#66433;'>b̊</unclear>
</w>

I have tried several variants of an xsl transformation file, but I confess that after several hours, I close to give up. Here is the last code, which sadly doesn't work:

<!-- Define the str-split function -->
   <xsl:template name="str-split">
      <xsl:param name="input" />
      <xsl:param name="delimiter" select="''" />
      <xsl:choose>
         <xsl:when test="contains($input, $delimiter)">
            <xsl:variable name="first" select="substring-before($input, $delimiter)" />
            <xsl:variable name="rest" select="substring-after($input, $delimiter)" />
            <char>
               <xsl:value-of select="$first" />
            </char>
            <xsl:call-template name="str-split">
               <xsl:with-param name="input" select="$rest" />
               <xsl:with-param name="delimiter" select="$delimiter" />
            </xsl:call-template>
         </xsl:when>
         <xsl:otherwise>
            <char>
               <xsl:value-of select="$input" />
            </char>
         </xsl:otherwise>
      </xsl:choose>
   </xsl:template>
   
   <!-- Define Unicode data directly in the variable -->
   <xsl:variable name="unicodeData">
      <data>
         <row>
            <latin>ʾa</latin>
            <Unicode>66432</Unicode>
         </row>
         <row>
            <latin>b</latin>
            <Unicode>66433</Unicode>
         </row>
         <row>
            <latin>g</latin>
            <Unicode>66434</Unicode>
         </row>
         <row>
            <latin>ḫ</latin>
            <Unicode>66435</Unicode>
         </row>
         <row>
            <latin>d</latin>
            <Unicode>66436</Unicode>
         </row>
       <!-- etc -->
      </data>
   </xsl:variable>
   
   <xsl:template match="/">
      <!-- Display the value of the variable $unicodeData -->
      <xsl:message select="$unicodeData" />
      
      <xsl:apply-templates/>
   </xsl:template>

   
   <!-- XSLT template for adding @correspUnic to w, g, unclear, name, seg, and supplied -->
   <xsl:template match="w | g | unclear | name | seg | supplied">
      <!-- Copy current element -->
      <xsl:copy>
         <!-- Apply rules to add @correspUnic to children -->
         <xsl:apply-templates select="node()" />
         <!-- Check whether the current element must have @correspUnic -->
         <xsl:if test="self::name or self::seg or self::supplied or self::w or self::g or self::unclear">
            <!-- Recover Latinized characters from textual descendants -->
            <xsl:variable name="latinized">
               <xsl:for-each select="descendant::text()">
                  <xsl:value-of select="." />
               </xsl:for-each>
            </xsl:variable>
            <!-- Check if Latinized characters are detected -->
            <xsl:if test="normalize-space($latinized)">
               <!-- Use the str-split function to split the string -->
               <xsl:variable name="correspUnicode">
                  <xsl:call-template name="str-split">
                     <xsl:with-param name="input" select="$latinized" />
                  </xsl:call-template>
               </xsl:variable>
               <!-- Add @correspUnic attribute with Unicode values -->
               <xsl:attribute name="correspUnic">
                  <xsl:for-each select="$correspUnicode/char">
                     <xsl:variable name="char" select="." />
                     <xsl:if test="normalize-space($char)">
                        <xsl:value-of select="concat('&amp;#', $unicodeData//row[latin = $char]/Unicode, ';')" />
                     </xsl:if>
                  </xsl:for-each>
               </xsl:attribute>
            </xsl:if>
         </xsl:if>
      </xsl:copy>
   </xsl:template>

As you can see, I added xsl:message to see any errors that would have a direct impact on adding the attribute and its values, but nothing...

Thank you very much in advance for your advice and suggestions.

I am struggling to understand what your question is. Are you trying to split a given string (e.g. `bn`) into individual characters and lookup the value of each character from a table hard-coded into your stylesheet? That shouldn't be very difficult - esp. If you can use XSLT 2.0 or higher. If not, please identify your processor (see: https://stackoverflow.com/a/25245033/3016153). — michael.hor257k, Aug 24 '23 at 18:51
Do you know https://www.w3.org/TR/xpath-functions-31/#func-string-to-codepoints and https://www.w3.org/TR/xpath-functions-31/#func-codepoints-to-string? — Martin Honnen, Aug 25 '23 at 07:14
Inside of oXygen with access to Saxon EE 11 or 12 you could also consider the new XPath 4.0 function `fn:characters($string as xs:string?) as xs:string*` https://www.saxonica.com/html/documentation11/v4extensions/new-functions.html as it does "Splits a string into a sequence of single-character strings. For example, `fn:characters("red")` returns `("r", "e", "d")`" — Martin Honnen, Aug 25 '23 at 07:18
It appears that some of the Ugaritic letters are represented by more than one Latin character (e.g. `ʾa` in your example). If so, the first step should be to replace these combinations with a single character (unused elsewhere). After that is done, the rest would be a trivial task: simply use the `translate()` function to replace each single character with its Ugaritic counterpart. — y.arazim, Aug 25 '23 at 07:21

score 3 · Answer 1 · answered Aug 25 '23 at 08:17

3

Using the transliteration table from here, I came up with the following code (requires XSLT 2.0):

<xsl:stylesheet version="2.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="ASCII" indent="yes"/>
<xsl:strip-space elements="*"/>

<xsl:variable name="latin">abgḫdhwzḥṭykšlmḏnẓspṣqrṯġtiuSʾ</xsl:variable>
<xsl:variable name="ugaritic"></xsl:variable>

<!-- identity transform -->
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="(w|g|name|seg)[text()]">
    <xsl:variable name="adjusted" select="replace(., 's2', 'S')" />
    <xsl:copy>
        <xsl:attribute name="correspUnic">
            <xsl:value-of select="translate($adjusted, $latin, $ugaritic)" />
        </xsl:attribute>
        <xsl:apply-templates/>
    </xsl:copy>
</xsl:template>

</xsl:stylesheet>

(I set the output encoding to ASCII just to be able to recognize the output characters).

Applying this to the following XML:

<root>
    <w>bn</w>
    <g>.</g>
    <name>qdš</name>
    <w>
      <seg>ʾa</seg>
      <unclear>b̊</unclear>
    </w>
</root>

I get:

<?xml version="1.0" encoding="ASCII"?>
<root>
   <w correspUnic="&#x10381;&#x10390;">bn</w>
   <g correspUnic=".">.</g>
   <name correspUnic="&#x10395;&#x10384;&#x1038c;">qd&#x161;</name>
   <w>
      <seg correspUnic="&#x1039d;&#x10380;">&#x2be;a</seg>
      <unclear>b&#x30a;</unclear>
   </w>
</root>

Apparently you have some more entries in your transliteration table, but that should be a very simple modification.

answered Aug 25 '23 at 08:17

y.arazim

866
2
10

Thank you very much! However, I use UTF-8 for all my other files for my current research. I am worried that it might be difficult to juggle UTF-8 and ASCII in the same project. Nonetheless, I am saving your approach, which may prove useful in another project. – Vanessa Aug 25 '23 at 13:45
As I said, I only set the encoding to ASCII so that I can recognize the output (I do not read Ugaritic). You can simply set `encoding="UTF-8"` and get the actual characters instead of numeric character entities. Do note that in XML `` or `` represent exactly the same thing as ``. Unless you are dealing with some non-conforming parser down the road, there is no reason to go to such great lengths to get the numeric references. – y.arazim Aug 25 '23 at 14:14
As far as I know, Oxygen does not display ugaritic glyphs, so in order to check them manually -- on a few random lines -- I prefer to see the unicode decimal values to make sure that the correct ugaritic glyphs will be displayed in html. Anyway, thanks for your explanations @y-arazim, much appreciated. – Vanessa Aug 25 '23 at 14:22
So what does it display instead? What do you get if your input is `` and you do only the identity transform in XSLT? – y.arazim Aug 25 '23 at 14:51
If I add `` to my XML files, I have rectangular shapes (in Oxygen XML with MacOS). Of course, when I make a transformation to display in a browser, although I have rectangular shapes in Oxygen, I will have the glyphs correctly displayed in html in my browser. I have thousands of lines with Ugaritic lexemes, easier to read unicode than rectangular shapes which consequently give no information on the precise value of the unicode (or glyph). – Vanessa Aug 25 '23 at 15:07
2

I see. Well. you *could* use ASCII encoding for testing (as I did). You could also combine my method with Martin's: first translate the Latinized text to Ugaritic characters; then use a character map to force the characters to appear as numeric character entities. – y.arazim Aug 25 '23 at 15:34

score 1 · Accepted Answer · answered Aug 25 '23 at 07:53

Perhaps the following helps, though I have not quite understood the whole lot of characters used:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="#all"
    version="3.0">

  <xsl:mode on-no-match="shallow-copy"/>
  
  <!-- Define Unicode data directly in the variable -->
  <xsl:param name="unicodeData">
      <data>
         <row>
            <latin>ʾa</latin>
            <Unicode>66432</Unicode>
         </row>
         <row>
            <latin>b</latin>
            <Unicode>66433</Unicode>
         </row>
         <row>
            <latin>n</latin>
            <Unicode>66448</Unicode>
         </row>
         <row>
            <latin>g</latin>
            <Unicode>66434</Unicode>
         </row>
         <row>
            <latin>ḫ</latin>
            <Unicode>66435</Unicode>
         </row>
         <row>
            <latin>d</latin>
            <Unicode>66436</Unicode>
         </row>
       <!-- etc -->
      </data>
  </xsl:param>
   
  <xsl:key name="latin-to-unicode" match="row" use="latin"/>
  
  <xsl:character-map name="ugaritic">
    <xsl:output-character character="&#66433;" string="&amp;#66433;"/>
    <xsl:output-character character="&#66448;" string="&amp;#66448;"/>
    <!-- ... -->
  </xsl:character-map>

  <xsl:output method="xml" use-character-maps="ugaritic"/>

  <xsl:template match="*[text()[normalize-space()]]">
    <xsl:copy>
      <xsl:attribute name="correspUnic">
        <xsl:apply-templates select="text()" mode="map"/>
      </xsl:attribute>
      <xsl:apply-templates/>
    </xsl:copy>
  </xsl:template>
  
  <xsl:template match="text()" mode="map">
    <xsl:for-each select="string-to-codepoints(.) ! codepoints-to-string(.)">
      <xsl:sequence select="key('latin-to-unicode', ., $unicodeData)/Unicode => codepoints-to-string()"/>
    </xsl:for-each>
  </xsl:template>
  
</xsl:stylesheet>

Transforms <w>bn</w> into <w correspUnic="𐎁𐎐">bn</w>.

Thank you so so much Martin! It works great--I didn't know `map` `mode`. However there is a problem for letters beginning with alef ʾ like ʾa (66432), ʾi (66459), ʾu (66460). Just nothing. I guess it is because it is interpreted as two characters, but it is just one glyph. How can I get around this? — Vanessa, Aug 25 '23 at 13:35
The problem handling ʾa (66432), ʾi (66459), ʾu (66460) has been solved with regex! I will make an update to the final code. — Vanessa, Aug 25 '23 at 17:48

score 1 · Answer 3 · answered Aug 25 '23 at 18:07

Thanks to Martin who helped me solve the problem of displaying @correspUnic values. On the other hand, there was a problem displaying unicode decimal values of ʾa (66432), ʾi (66459), ʾu (66460) which were probably interpreted as two characters, but this is not the case: in Ugaritic, it is indeed a glyph. To get around the problem, I used regex. Then I had to do some additional processing to replace & with &--which wasn't very simple, given that & is de facto understood as preceding an entity. I'm not saying it is the best solution, but it works.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   xmlns:xs="http://www.w3.org/2001/XMLSchema"
   exclude-result-prefixes="#all"
   version="3.0">
   
   <xsl:mode on-no-match="shallow-copy"/>
   
   <!-- Define Unicode data directly in the variable -->
   <xsl:param name="unicodeData">
      <data>
         <row>
            <latin>ʾa</latin>
            <Unicode>66432</Unicode>
         </row>
         <row>
            <latin>b</latin>
            <Unicode>66433</Unicode>
         </row>
         <row>
            <latin>g</latin>
            <Unicode>66434</Unicode>
         </row>
         <row>
            <latin>ḫ</latin>
            <Unicode>66435</Unicode>
         </row>
         <row>
            <latin>d</latin>
            <Unicode>66436</Unicode>
         </row>
         <row>
            <latin>h</latin>
            <Unicode>66437</Unicode>
         </row>
         <!-- etc -->
      </data>
   </xsl:param>
   
   <xsl:key name="latin-to-unicode" match="row" use="latin"/>
   
   <xsl:character-map name="ugaritic">
      <xsl:output-character character="&#66432;" string="&amp;#66432;"/>
      <xsl:output-character character="&#66433;" string="&amp;#66433;"/>
      <xsl:output-character character="&#66434;" string="&amp;#66434;"/>
      <xsl:output-character character="&#66435;" string="&amp;#66435;"/>
      <xsl:output-character character="&#66436;" string="&amp;#66436;"/>
      <xsl:output-character character="&#66437;" string="&amp;#66437;"/>
      <!-- etc -->
   </xsl:character-map>

 <xsl:output method="xml" use-character-maps="ugaritic"/>
<!-- for example -->
<!-- Apply correspUnic attribute only to w elements whose text does not come from child elements unclear, seg, supplied -->
   <xsl:template match="w[(not(child::unclear) and not(child::seg) and not(child::supplied)) and text() and (not(@correspUnic) or string-length(normalize-space(@correspUnic)) = 0)]">
      <xsl:copy>
         <xsl:apply-templates select="@*"/>
         <xsl:attribute name="correspUnic">
            <xsl:apply-templates select="text()" mode="map"/>
         </xsl:attribute>
         <xsl:apply-templates/>
      </xsl:copy>
   </xsl:template>

<xsl:template match="text()" mode="map">
      <xsl:analyze-string select="." regex="ʾ[aiu]">
         <xsl:matching-substring>
            <xsl:variable name="matchedChar" select="." />
            <xsl:variable name="unicodeValue">
               <xsl:choose>
                  <xsl:when test="$matchedChar = 'ʾa'">66432</xsl:when>
                  <xsl:when test="$matchedChar = 'ʾi'">66459</xsl:when>
                  <xsl:when test="$matchedChar = 'ʾu'">66460</xsl:when>
               </xsl:choose>
            </xsl:variable>
            <!-- Create a Unicode string at once -->
            <xsl:variable name="unicodeString" select="codepoints-to-string($unicodeValue)"/>
            <!-- remove all &amp; -->
            <xsl:variable name="cleanedString" select="replace($unicodeString, '&amp;', '')"/>
            <xsl:sequence select="$cleanedString"/>
         </xsl:matching-substring>
         <xsl:non-matching-substring>
            <xsl:for-each select="string-to-codepoints(.) ! codepoints-to-string(.)">
               <xsl:sequence select="key('latin-to-unicode', ., $unicodeData)/Unicode => codepoints-to-string()"/>
            </xsl:for-each>
         </xsl:non-matching-substring>
      </xsl:analyze-string>
   </xsl:template>
   
   
</xsl:stylesheet>

IMHO you are making this much more complicated than it needs to be. You only need 3 steps: (1) convert any 2-character representation to a single character; (2) use the `translate()` function to convert the Latin characters to their corresponding Ugaritic characters, and (3) use a character map to force the Ugaritic characters to be output as numeric character entities. All that stuff with breaking the given string into individual characters and looking up their corresponding Unicode codepoint numbers from a table is completely unnecessary. — michael.hor257k, Aug 27 '23 at 11:00
Note also that when you have a 2-character representation like `ʾa` or `ʾi` or `ʾu` while no other characters are represented by the **single** characters `a` or `i` or `u`, then you can simply ignore the `ʾ` character (translate it out). This is all shown in @y.arazim's answer, with the exception of step #3. I don't know why you insist on ignoring it and doing it the hard way. — michael.hor257k, Aug 27 '23 at 11:00

Automatically add an attribute and values based on Latinized characters between element

3 Answers3