36

I'm trying to convert an XML file into the markup used by dokuwiki, using XSLT. This actually works to some degree, but the indentation in the XSL file is getting inserted into the results. At the moment, I have two choices: abandon this XSLT thing entirely, and find another way to convert from XML to dokuwiki markup, or delete about 95% of the whitespace from the XSL file, making it nigh-unreadable and a maintenance nightmare.

Is there some way to keep the indentation in the XSL file without passing all that whitespace on to the final document?

Background: I'm migrating an autodoc tool from static HTML pages over to dokuwiki, so the API developed by the server team can be further documented by the applications team whenever the apps team runs into poorly-documented code. The logic is to have a section of each page set aside for the autodoc tool, and to allow comments anywhere outside this block. I'm using XSLT because we already have the XSL file to convert from XML to XHTML, and I'm assuming it will be faster to rewrite the XSL than to roll my own solution from scratch.

Edit: Ah, right, foolish me, I neglected the indent attribute. (Other background note: I am new to XSLT.) On the other hand, I still have to deal with newlines. Dokuwiki uses pipes to differentiate between table columns, which means that all of the data in a table line must be on one line. Is there a way to suppress newlines being outputted (just occasionally), so I can do some fairly complex logic for each table cell in a somewhat readable fasion?

Mathias Müller
  • 22,203
  • 13
  • 58
  • 75
PotatoEngineer
  • 1,572
  • 3
  • 20
  • 26

4 Answers4

77

There are three reasons for getting unwanted whitespace in the result of an XSLT transformation:

  1. whitespace that comes from between nodes in the source document
  2. whitespace that comes from within nodes in the source document
  3. whitespace that comes from the stylesheet

I'm going to talk about all three because it can be hard to tell where whitespace comes from so you might need to use several strategies.

To address the whitespace that is between nodes in your source document, you should use <xsl:strip-space> to strip out any whitespace that appears between two nodes, and then use <xsl:preserve-space> to preserve the significant whitespace that might appear within mixed content. For example, if your source document looks like:

<ul>
  <li>This is an <strong>important</strong> <em>point</em></li>
</ul>

then you will want to ignore the whitespace between the <ul> and the <li> and between the </li> and the </ul>, which is not significant, but preserve the whitespace between the <strong> and <em> elements, which is significant (otherwise you'd get "This is an **important***point*"). To do this use

<xsl:strip-space elements="*" />
<xsl:preserve-space elements="li" />

The elements attribute on <xsl:preserve-space> should basically list all the elements in your document that have mixed content.

Aside: using <xsl:strip-space> also reduces the size of the source tree in memory, and makes your stylesheet more efficient, so it's worth doing even if you don't have whitespace problems of this sort.

To address the whitespace that appears within nodes in your source document, you should use normalize-space(). For example, if you have:

<dt>
  a definition
</dt>

and you can be sure that the <dt> element won't hold any elements that you want to do something with, then you can do:

<xsl:template match="dt">
  ...
  <xsl:value-of select="normalize-space(.)" />
  ...
</xsl:template>

The leading and trailing whitespace will be stripped from the value of the <dt> element and you will just get the string "a definition".

To address whitespace coming from the stylesheet, which is perhaps the one you're experiencing, is when you have text within a template like this:

<xsl:template match="name">
  Name:
  <xsl:value-of select="." />
</xsl:template>

XSLT stylesheets are parsed in the same way as the source documents that they process, so the above XSLT is interpreted as a tree that holds an <xsl:template> element with a match attribute whose first child is a text node and whose second child is a <xsl:value-of> element with a select attribute. The text node has leading and trailing whitespace (including line breaks); since it's literal text in the stylesheet, it gets literally copied over into the result, with all the leading and trailing whitespace.

But some whitespace in XSLT stylesheets get stripped automatically, namely those between nodes. You don't get a line break in your result because there's a line break between the <xsl:value-of> and the close of the <xsl:template>.

To get only the text you want in the result, use the <xsl:text> element like this:

<xsl:template match="name">
  <xsl:text>Name: </xsl:text>
  <xsl:value-of select="." />
</xsl:template>

The XSLT processor will ignore the line breaks and indentation that appear between nodes, and only output the text within the <xsl:text> element.

JeniT
  • 3,660
  • 20
  • 11
  • that was indeed helpful, but I'm puzzled by your use of the phrase "between nodes". Isn't it true that all whitespace is contained in text nodes? What do you mean by "between nodes"? If I hadn't recognized your name I would have assumed you needed a lecture on XML document structure. – LarsH Sep 05 '10 at 01:58
  • Good article, thanks! But strictly speaking, you're using the term 'node' where you actually mean 'element'. – rustyx Jan 05 '11 at 18:50
  • @LarsH: I'm outside of my domain here (and a few months late), but I think this answers your question: http://www.w3.org/TR/xslt#strip "...some text nodes are stripped. A text node is never stripped unless it contains only whitespace characters." "A text node is preserved if ... the text node contains at least one non-whitespace character." – Dan Jan 16 '11 at 04:08
4

Are you using indent="no" in your output tag?

<xsl:output method="text" indent="no" />

Also if you're using xsl:value-of you can use the disable-output-escaping="yes" to help with some whitespace issues.

Lindsay
  • 856
  • 1
  • 9
  • 13
  • 4
    Most of the time, using `disable-output-escaping` is the wrong way to do things. It's only there for very restricted situations. Advocating d-o-e in such a general way to someone who doesn't know better is probably more harmful than helpful. See http://www.dpawson.co.uk/xsl/sect2/N2215.html#d3702e223 – LarsH Sep 05 '10 at 01:51
3

@JeniT's answer is great, I just want to point out a trick for managing whitespace. I'm not certain it's the best way (or even a good way), but it works for me for now.

("s" for space, "e" for empty, "n" for newline.)

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xsl:transform [
  <!ENTITY s "<xsl:text xmlns:xsl='http://www.w3.org/1999/XSL/Transform'> </xsl:text>" >
  <!ENTITY s2 "<xsl:text xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>  </xsl:text>" >
  <!ENTITY s4 "<xsl:text xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>    </xsl:text>" >
  <!ENTITY s6 "<xsl:text xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>      </xsl:text>" >
  <!ENTITY e "<xsl:text xmlns:xsl='http://www.w3.org/1999/XSL/Transform'></xsl:text>" >
  <!ENTITY n "<xsl:text xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
</xsl:text>" >
]>

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xsl:output method="text"/>
<xsl:template match="/">
  &e;Flush left, despite the indentation.&n;
  &e;  This line will be output indented two spaces.&n;

      <!-- the blank lines above/below won't be output -->

  <xsl:for-each select="//foo">
    &e;  Starts with two blanks: <xsl:value-of select="@bar"/>.&n;
    &e;  <xsl:value-of select="@baz"/> The 'e' trick won't work here.&n;
    &s2;<xsl:value-of select="@baz"/> Use s2 instead.&n;
    &s2;    <xsl:value-of select="@abc"/>    <xsl:value-of select="@xyz"/>&n;
    &s2;    <xsl:value-of select="@abc"/>&s;<xsl:value-of select="@xyz"/>&n;
  </xsl:for-each>
</xsl:template>
</xsl:transform>

Applied to:

<?xml version="1.0" encoding="UTF-8"?>
<foo bar="bar" baz="baz" abc="abc" xyz="xyz"></foo>

Outputs:

Flush left, despite the indentation.
  This line will be output indented two spaces.
  Starts with two blanks: bar.
baz The 'e' trick won't work here.
  baz Use s2 instead.
  abcxyz
  abc xyz

The 'e' trick works prior to a text node containing at least one non-whitespace character because it expands to this:

<xsl:template match="/">
  <xsl:text></xsl:text>Flush left, despite the indentation.<xsl:text>
</xsl:text>

Since the rules for stripping whitespace say that whitespace-only text nodes get stripped, the newline and indentation between the <xsl:template> and <xsl:text> get stripped (good). Since the rules say a text node with at least one whitespace character is preserved, the implicit text node containing " This line will be output indented two spaces." keeps its leading whitespace (but I guess this also depends on the settings for strip/preserve/normalize). The "&n;" at the end of the line inserts a newline, but it also ensures that any following whitespace is ignored, because it appears between two nodes.

The trouble I have is when I want to output an indented line that begins with an <xsl:value-of>. In that case, the "&e;" won't help, because the indentation whitespace isn't "attached" to any non-whitespace characters. So for those cases, I use "&s2;" or "&s4;", depending on how much indentation I want.

It's an ugly hack I'm sure, but at least I don't have the verbose "<xsl:text>" tags littering my XSLT, and at least I can still indent the XSLT itself so it's legible. I feel like I'm abusing XSLT for something it was not designed for (text processing) and this is the best I can do.


Edit: In response to comments, this is what it looks like without the "macros":

<xsl:template match="/">
  <xsl:text>Flush left, despite the indentation.</xsl:text>
  <xsl:text>  This line will be output indented two spaces.</xsl:text>
  <xsl:for-each select="//foo">
    <xsl:text>  Starts with two blanks: </xsl:text><xsl:value-of select="@bar"/>.<xsl:text>
</xsl:text>
    <xsl:text>    </xsl:text><xsl:value-of select="@abc"/><xsl:text> </xsl:text><xsl:value-of select="@xyz"/><xsl:text>
</xsl:text>
  </xsl:for-each>
</xsl:template>

I think that makes it less clear to see the intended output indentation, and it screws up the indentation of the XSL itself because the </xsl:text> end tags have to appear at column 1 of the XSL file (otherwise you get undesired whitespace in the output file).

Dan
  • 5,929
  • 6
  • 42
  • 52
  • @Dan: First, `xsl:text` it's not verbose, and you always can use concat on `xsl:value-of`. Second, you are not processing text, your output is plain text. –  Jan 17 '11 at 16:10
  • @Dan: Last. Your solution is against XSLT because those entities (properly declared) are part of the surface syntax of the XML document (the stylesheet, in this case). So, the replacement takes time in the parsing fase, before reaching the XSLT processor. Once the replace was performed and there are **new elements** in the stylesheet, the rules for stripping/preserving whitespace only text nodes are applied. From a reader's point of view, it won't be clear what would be your stylesheet result. –  Jan 17 '11 at 16:12
  • @Alejandro: thanks for the feedback. I suppose it's not verbose if you're already accustomed to XML... my background is more lex/yacc/C++ so I'm definitely feeling out of my element here. I suppose using an XML editor vs. a text editor might help. – Dan Jan 17 '11 at 17:39
  • @Alejandro: regarding whether it's clear or not... I guess that's a matter of opinion. Either using `xsl:text` or the `&e;` type "macros" is better than what was proposed as an alternative in the question: "delete about 95% of the whitespace from the XSL file, making it nigh-unreadable and a maintenance nightmare." – Dan Jan 17 '11 at 17:41
  • @Dan: What shows that it's not a matter of opinion is the need of `&s2;` instead of `&e;` in some cases, for the same effect. –  Jan 17 '11 at 18:35
  • @Alejandro: yes, I'm not completely satisfied with that. The `&n;` I think is great though -- as far as I can tell, any alternative will either screw up your XSL indentation or screw up the whitespace in the output. See my edit -- not using the "macros" looks nasty. – Dan Jan 17 '11 at 18:47
0

Regarding your edit about new lines, you can use this template to recursively replace one string within another string, and you can use it for line breaks:

<xsl:template name="replace.string.section">
  <xsl:param name="in.string"/>
  <xsl:param name="in.characters"/>
  <xsl:param name="out.characters"/>
  <xsl:choose>
    <xsl:when test="contains($in.string,$in.characters)">
      <xsl:value-of select="concat(substring-before($in.string,$in.characters),$out.characters)"/>
      <xsl:call-template name="replace.string.section">
        <xsl:with-param name="in.string" select="substring-after($in.string,$in.characters)"/>
        <xsl:with-param name="in.characters" select="$in.characters"/>
        <xsl:with-param name="out.characters" select="$out.characters"/>
      </xsl:call-template>
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of select="$in.string"/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template> 

Call it as follows (this example replaces line breaks in the $some.string variable with a space):

    <xsl:call-template name="replace.string.section">
        <xsl:with-param name="in.string" select="$some.string"/>
        <xsl:with-param name="in.characters" select="'&#xA;'"/>
        <xsl:with-param name="out.characters" select="' '"/>
    </xsl:call-template>
Odilon Redo
  • 551
  • 4
  • 5