3

I am using XSLT (XSLT 2.0 is fine) to transform XML (TEI) to readable plaintext (with some minor modifications/challenges—preserving space for poetry; making titles all upper case).

So far everything is working as I'd like, but in the interests of readability I'd additionally like to limit the length of a line of text output by this transformation to some value (like 80 chars wide), splitting only on spaces (not breaking words apart, etc). I want to set a maximum length for output (or, say, 80 chars), not just output the first, say, 80 chars.

Does anyone have suggestions about the best approach? Is a template that matches all text() and then uses XSLT's built in string functions the way to go? I'm trying to imagine using string functions (string-length and substring or similar) to do this, but not having any luck yet.

(I could do this separately, using a python script, pretty easily, so perhaps "do it afterwards" may be the best answer. I'd love to know if I'm overlooking a simple solution though.)

cforster
  • 577
  • 2
  • 7
  • 19
  • 1
    Can you use XSLT 2.0? -- Also, please clarify if you want to split the text into multiple lines, or output only the first 80 characters or less. – michael.hor257k Dec 05 '15 at 16:00
  • I *can* use XSLT 2.0; and I would like to split the text into multiple lines of a specified maximum length, *not* simply output the first 80 characters. – cforster Dec 05 '15 at 18:51
  • Splitting only on spaces would be relatively easy. However, it would not be a good solution. There are many other word-delimiting characters - but XPath regex has no anchor for a word boundary. Why do you have to do this at all? Why not leave it to the displaying application which is very likely to already have this feature? – michael.hor257k Dec 06 '15 at 00:06
  • Well, the *why* is because, for plaintext (or, say, markdown, which I might try to generate as well) the displaying application is uncertain, and possibly a text editor. There, long lines are ugly & unreadable. Project Gutenberg for instance splits at 70 chars in its text files. I've got a python script that splits lines longer than 80 chars, and it looks good. At this point, I think that trying to do it in pure XSLT isn't worth the bother. – cforster Dec 06 '15 at 04:12
  • @michael.hor257k, **Re: "XPath regex has no anchor for a word boundary**": Actually, XPath 2.0 **has** a complete set of character escapes and multi-character escapes, one of them being `\W` -- for any character that does not match `\w`. And `\w` matches any character, considered to form part of a word, as distinct ftom a separator between words; specifically a character that does not match `\p{P}` or `\p{Z}` or `\p{C}` . And this can be used, as in my answer, for a complete and efficient solution. – Dimitre Novatchev Dec 06 '15 at 18:19
  • @DimitreNovatchev http://stackoverflow.com/questions/25446314/in-saxon-9-he-java-xml-parser-word-boundaries-b-in-regular-expressions-are-n/25464233#25464233 – michael.hor257k Dec 06 '15 at 18:54
  • @michael.hor257k, The lack of `\b` in the Regex language grammar of XSD doesn't mean at all that there isn't a way to denote a non-word character. `\W` matches exactly these. – Dimitre Novatchev Dec 06 '15 at 19:02
  • @DimitreNovatchev Denoting a non-word character and finding a word boundary are two separate things. Ideally (though I don't think the Java implementation does that), splitting on word boundaries should keep a comma together with the preceding word. Similarly, a quoted word should keep the quotes from both sides. There are many other pitfalls that a naive implementation such as the ones suggested here will fall into. To give just another example, a line should not end with `".. something. A"`. – michael.hor257k Dec 07 '15 at 09:06

1 Answers1

6

I. Here is a solution I wrote more than 10 years ago.

This transformation (from the FXSL library):

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:f="http://fxsl.sf.net/"
 xmlns:str-split2lines-func="f:str-split2lines-func"
 exclude-result-prefixes="f str-split2lines-func">

   <xsl:import href="str-foldl.xsl"/>
   <xsl:output method="text"/>

   <str-split2lines-func:str-split2lines-func/>

    <xsl:template match="/">
      <xsl:call-template name="str-split-to-lines">
        <xsl:with-param name="pStr" select="/*"/>
        <xsl:with-param name="pLineLength" select="64"/>
        <xsl:with-param name="pDelimiters" select="' &#9;&#10;&#13;'"/>
      </xsl:call-template>
    </xsl:template>

    <xsl:template name="str-split-to-lines">
      <xsl:param name="pStr"/>
      <xsl:param name="pLineLength" select="60"/>
      <xsl:param name="pDelimiters" select="' &#9;&#10;&#13;'"/>

      <xsl:variable name="vsplit2linesFun"
                    select="document('')/*/str-split2lines-func:*[1]"/>

      <xsl:variable name="vrtfParams">
       <delimiters><xsl:value-of select="$pDelimiters"/></delimiters>
       <lineLength><xsl:copy-of select="$pLineLength"/></lineLength>
      </xsl:variable>

      <xsl:variable name="vResult">
          <xsl:call-template name="str-foldl">
            <xsl:with-param name="pFunc" select="$vsplit2linesFun"/>
            <xsl:with-param name="pStr" select="$pStr"/>
            <xsl:with-param name="pA0" select="$vrtfParams"/>
          </xsl:call-template>
      </xsl:variable>

      <xsl:for-each select="$vResult/line">
        <xsl:for-each select="word">
          <xsl:value-of select="concat(., ' ')"/>
        </xsl:for-each>
        <xsl:value-of select="'&#10;'"/>
      </xsl:for-each>
    </xsl:template>

    <xsl:template match="str-split2lines-func:*" mode="f:FXSL">
      <xsl:param name="arg1" select="/.."/>
      <xsl:param name="arg2"/>

      <xsl:copy-of select="$arg1/*[position() &lt; 3]"/>
      <xsl:copy-of select="$arg1/line[position() != last()]"/>

      <xsl:choose>
        <xsl:when test="contains($arg1/*[1], $arg2)">
          <xsl:if test="string($arg1/word)">
             <xsl:call-template name="fillLine">
               <xsl:with-param name="pLine" select="$arg1/line[last()]"/>
               <xsl:with-param name="pWord" select="$arg1/word"/>
               <xsl:with-param name="pLineLength" select="$arg1/*[2]"/>
             </xsl:call-template>
          </xsl:if>
        </xsl:when>
        <xsl:otherwise>
          <xsl:copy-of select="$arg1/line[last()]"/>
          <word><xsl:value-of select="concat($arg1/word, $arg2)"/></word>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:template>

      <!-- Test if the new word fits into the last line -->
    <xsl:template name="fillLine">
      <xsl:param name="pLine" select="/.."/>
      <xsl:param name="pWord" select="/.."/>
      <xsl:param name="pLineLength" />

      <xsl:variable name="vnWordsInLine" select="count($pLine/word)"/>
      <xsl:variable name="vLineLength" select="string-length($pLine) + $vnWordsInLine"/>
      <xsl:choose>
        <xsl:when test="not($vLineLength + string-length($pWord) > $pLineLength)">
          <line>
            <xsl:copy-of select="$pLine/*"/>
            <xsl:copy-of select="$pWord"/>
          </line>
        </xsl:when>
        <xsl:otherwise>
          <xsl:copy-of select="$pLine"/>
          <line>
            <xsl:copy-of select="$pWord"/>
          </line>
          <word/>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:template>

</xsl:stylesheet>

when applied on the following XML document:

<text>
Dec. 13 — As always for a presidential inaugural, security and surveillance were
extremely tight in Washington, DC, last January. But as George W. Bush prepared to
take the oath of office, security planners installed an extra layer of protection: a
prototype software system to detect a biological attack. The U.S. Department of
Defense, together with regional health and emergency-planning agencies, distributed
a special patient-query sheet to military clinics, civilian hospitals and even aid
stations along the parade route and at the inaugural balls. Software quickly
analyzed complaints of seven key symptoms — from rashes to sore throats — for
patterns that might indicate the early stages of a bio-attack. There was a brief
scare: the system noticed a surge in flulike symptoms at military clinics.
Thankfully, tests confirmed it was just that — the flu.
</text>

Justifies the text to fit in lines long at most 64 (any length can be specified as the value of the parameter $pLineLength) and the result is:

Dec. 13 — As always for a presidential inaugural, security and 
surveillance were extremely tight in Washington, DC, last 
January. But as George W. Bush prepared to take the oath of 
office, security planners installed an extra layer of 
protection: a prototype software system to detect a biological 
attack. The U.S. Department of Defense, together with regional 
health and emergency-planning agencies, distributed a special 
patient-query sheet to military clinics, civilian hospitals and 
even aid stations along the parade route and at the inaugural 
balls. Software quickly analyzed complaints of seven key 
symptoms — from rashes to sore throats — for patterns that might 
indicate the early stages of a bio-attack. There was a brief 
scare: the system noticed a surge in flulike symptoms at 
military clinics. Thankfully, tests confirmed it was just that — 
the flu. 

The separate stylesheet, which is imported in the above transformation is:

str-foldl.xsl:


<xsl:stylesheet version="2.0" 
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:f="http://fxsl.sf.net/"
 exclude-result-prefixes="f">
    <xsl:template name="str-foldl">
      <xsl:param name="pFunc" select="/.."/>
      <xsl:param name="pA0"/>
      <xsl:param name="pStr"/>

      <xsl:choose>
         <xsl:when test="not(string($pStr))">
            <xsl:copy-of select="$pA0"/>
         </xsl:when>
         <xsl:otherwise>
            <xsl:variable name="vFunResult">
              <xsl:apply-templates select="$pFunc[1]" mode="f:FXSL">
                <xsl:with-param name="arg0" select="$pFunc[position() > 1]"/>
                <xsl:with-param name="arg1" select="$pA0"/>
                <xsl:with-param name="arg2" select="substring($pStr,1,1)"/>
              </xsl:apply-templates>
            </xsl:variable>

            <xsl:call-template name="str-foldl">
                    <xsl:with-param name="pFunc" select="$pFunc"/>
                    <xsl:with-param name="pStr" 
                   select="substring($pStr,2)"/>
                    <xsl:with-param name="pA0" select="$vFunResult"/>
            </xsl:call-template>
         </xsl:otherwise>
      </xsl:choose>

    </xsl:template>
</xsl:stylesheet>

Do note that this is essentially an XSLT 1.0 solution. A shorter XSLT 2.0 solution is possible using the capabilities of XSLT 2.0 of regular expression processing.


II. Using XSLT 2.0 Regex

Here is a function -- f:getLine() -- that when passed a string and maximum-line-length, returns the first line from that string that is the longest starting substring (of the 1st maximum-line-length chunk) ending on word boundaries. The transformation below uses this function to produce the first line of the wanted multi-line result.

<xsl:stylesheet version="2.0"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:f="my:f" xmlns:xs="http://www.w3.org/2001/XMLSchema">
 <xsl:output method="text"/>

  <xsl:template match="/*/text()">
    <xsl:sequence select="f:getLine(., 64)"/>
  </xsl:template>

  <xsl:function name="f:getLine" as="xs:string?">
    <xsl:param name="pText" as="xs:string?"/>
    <xsl:param name="pLength" as="xs:integer"/>

    <xsl:variable name="vChunk" select="substring($pText, 1, $pLength)"/>

    <xsl:choose>
      <xsl:when test="not(string-length($pText) > $pLength) 
                      or matches(substring($pText, $pLength+1, 1), '\W')">
        <xsl:sequence select="$vChunk"/>
      </xsl:when>
      <xsl:otherwise>
            <xsl:analyze-string select="$vChunk" 
                 regex="^((\W*\w*)*?)(\W+\w*)$">
              <xsl:matching-substring>
                <xsl:sequence select="regex-group(1)"/>
              </xsl:matching-substring>
            </xsl:analyze-string>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:function>
</xsl:stylesheet>

When this transformation is applied on the same XML document, the correct first line is produced:

Dec. 13 — As always for a presidential inaugural, security and

Finally, the complete XSLT 2.0 transformation with RegEx:

<xsl:stylesheet version="2.0"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:f="my:f" xmlns:xs="http://www.w3.org/2001/XMLSchema">
 <xsl:output method="text"/>

  <xsl:template match="/*/text()" name="reformat">
    <xsl:param name="pText" select="translate(., '&#xA;', ' ')"/>
    <xsl:param name="pMaxLength" select="64"/>
    <xsl:param name="pTotalLength" select="string-length(.)"/>
    <xsl:param name="pLengthFormatted" select="0"/>

    <xsl:if test="not($pLengthFormatted >= $pTotalLength)">
        <xsl:variable name="vNextLine" 
         select="f:getLine(substring($pText, $pLengthFormatted+1), $pMaxLength)"/>
        <xsl:sequence select="concat($vNextLine, '&#xA;')"/>

        <xsl:call-template name="reformat">
          <xsl:with-param name="pText" select="$pText"/>
          <xsl:with-param name="pMaxLength" select="$pMaxLength"/>
          <xsl:with-param name="pTotalLength" select="$pTotalLength"/>
          <xsl:with-param name="pLengthFormatted" 
                    select="$pLengthFormatted + string-length($vNextLine)"/>
        </xsl:call-template>
    </xsl:if>
  </xsl:template>

  <xsl:function name="f:getLine" as="xs:string?">
    <xsl:param name="pText" as="xs:string?"/>
    <xsl:param name="pLength" as="xs:integer"/>

    <xsl:variable name="vChunk" select="substring($pText, 1, $pLength)"/>

    <xsl:choose>
      <xsl:when test="not(string-length($pText) > $pLength) 
                      or matches(substring($pText, $pLength+1, 1), '\W')">
        <xsl:sequence select="$vChunk"/>
      </xsl:when>
      <xsl:otherwise>
            <xsl:analyze-string select="$vChunk" 
                 regex="^((\W*\w*)*?)(\W+\w*)$">
              <xsl:matching-substring>
                <xsl:sequence select="regex-group(1)"/>
              </xsl:matching-substring>
            </xsl:analyze-string>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:function>
</xsl:stylesheet>
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431