7

In OOXML, formatting such as bold, italic, etc. can be (and often annoyingly is) split up between multiple elements, like so:

<w:p>
    <w:r>
        <w:rPr>
            <w:b/>
         </w:rPr>
         <w:t xml:space="preserve">This is a </w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t xml:space="preserve">bold </w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
            <w:i/>
        </w:rPr>
        <w:t>with a bit of italic</w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t xml:space="preserve"> </w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t>paragr</w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t>a</w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t>ph</w:t>
    </w:r>
    <w:r>
        <w:t xml:space="preserve"> with some non-bold in it too.</w:t>
    </w:r>
</w:p>

I need to combine these formatting elements to produce this:

<p><b>This is a mostly bold <i>with a bit of italic</i> paragraph</b> with some non-bold in it too.</p>

My initial approach was going to be to write out the start formatting tag when it is first encountered using:

 <xsl:text disable-output-escaping="yes">&lt;b&gt;</xsl:text>

And then after I process each <w:r>, check the next one to see if the formatting is still present. If it's not, add the end tag in the same way I add the start tag. I keep thinking there must be a better way to do this, and I'd be grateful for any suggestions.

Should also mention that I am tied to XSLT 1.0.

The reason for needing this, is that we need to compare an XML file before it is transformed into OOXML, and after it is transformed out of OOXML. The extra formatting tags make it appear as though changes were made when they were not.

Jacqueline
  • 193
  • 1
  • 15
  • It's not clear to me how (and why) you are going to use `xsl:text`. Is your target just convert the OOXML you have shown in the HTML? – Emiliano Poggi Jun 09 '11 at 19:38
  • @empo - I was going to write out the start tag using '' where the formatting begins, and write out the end tag in the same manner. In the above example, I would write out the start '' tag before processing the string "This is a", and I'd write out the '' after processing the string "ph". – Jacqueline Jun 09 '11 at 20:16
  • 1
    @Jacqueline - by trying to disable output escaping and write out markup instead of elements, you're fighting XSLT's modus operandi and asking for a big headache. Though I agree doing it "right" is not trivial. – LarsH Jun 10 '11 at 04:00
  • @LarsH - I couldn't agree more. – Jacqueline Jun 10 '11 at 16:22
  • @Jacqueline: It seems to me that there are other, easier ways to perform the comparison, without producing the wanted XML. Would you accept such a solution? – Dimitre Novatchev Jun 15 '11 at 04:09
  • @Dimitre: Unfortunately we do need to produce the XML as well - it gets used by others. – Jacqueline Jun 15 '11 at 12:57
  • @Jacqueline: You only need this difficult procedure for the comparison. Others could use the non-merged xml as it is equivalent to the merged one. Any arguments against this? – Dimitre Novatchev Jun 15 '11 at 13:43
  • @Dimitre: I like you're thinking, but unfortunately the file gets sent to another party who will also do a compare against the pre-OOXML version. Sorry... :( – Jacqueline Jun 15 '11 at 14:49
  • @Jacqueline: NP. I may enter a solution these days, so no need to accept an answer before the deadline. :) – Dimitre Novatchev Jun 15 '11 at 15:57
  • @Dimitre: I will look forward to it! – Jacqueline Jun 15 '11 at 18:06
  • @Jacqueline: Excellent question, +1. Please see my answer for a complete and generic (no hardcoded element names) XSLT 1.0 solution and its explanation. :) – Dimitre Novatchev Jun 16 '11 at 04:29

4 Answers4

7

Here is a complete XSLT 1.0 solution:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:ext="http://exslt.org/common" xmlns:w="w"
 exclude-result-prefixes="ext w">
 <xsl:output omit-xml-declaration="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="w:p">
  <xsl:variable name="vrtfPass1">
   <p>
    <xsl:apply-templates/>
   </p>
  </xsl:variable>

  <xsl:apply-templates mode="pass2"
   select="ext:node-set($vrtfPass1)/*"/>
 </xsl:template>

 <xsl:template match="w:r">
  <xsl:variable name="vrtfProps">
   <xsl:for-each select="w:rPr/*">
    <xsl:sort select="local-name()"/>
    <xsl:copy-of select="."/>
   </xsl:for-each>
  </xsl:variable>

  <xsl:call-template name="toHtml">
   <xsl:with-param name="pProps" select=
       "ext:node-set($vrtfProps)/*"/>
   <xsl:with-param name="pText" select="w:t/text()"/>
  </xsl:call-template>
 </xsl:template>

 <xsl:template name="toHtml">
  <xsl:param name="pProps"/>
  <xsl:param name="pText"/>

  <xsl:choose>
   <xsl:when test="not($pProps)">
     <xsl:copy-of select="$pText"/>
   </xsl:when>
   <xsl:otherwise>
    <xsl:element name="{local-name($pProps[1])}">
      <xsl:call-template name="toHtml">
        <xsl:with-param name="pProps" select=
            "$pProps[position()>1]"/>
        <xsl:with-param name="pText" select="$pText"/>
      </xsl:call-template>
    </xsl:element>
   </xsl:otherwise>
  </xsl:choose>
 </xsl:template>

  <xsl:template match="/*" mode="pass2">
  <xsl:copy>
    <xsl:copy-of select="@*"/>
    <xsl:call-template name="processInner">
     <xsl:with-param name="pNodes" select="node()"/>
    </xsl:call-template>
  </xsl:copy>
 </xsl:template>

 <xsl:template name="processInner">
  <xsl:param name="pNodes"/>

  <xsl:variable name="pNode1" select="$pNodes[1]"/>

  <xsl:if test="$pNode1">
   <xsl:choose>
    <xsl:when test="not($pNode1/self::*)">
     <xsl:copy-of select="$pNode1"/>
     <xsl:call-template name="processInner">
      <xsl:with-param name="pNodes" select=
      "$pNodes[position()>1]"/>
     </xsl:call-template>
    </xsl:when>
    <xsl:otherwise>
      <xsl:variable name="vbatchLength">
        <xsl:call-template name="getBatchLength">
         <xsl:with-param name="pNodes"
              select="$pNodes[position()>1]"/>
         <xsl:with-param name="pName"
             select="name($pNode1)"/>
         <xsl:with-param name="pCount" select="1"/>
        </xsl:call-template>
      </xsl:variable>

      <xsl:element name="{name($pNode1)}">
        <xsl:copy-of select="@*"/>

        <xsl:call-template name="processInner">
         <xsl:with-param name="pNodes" select=
         "$pNodes[not(position()>$vbatchLength)]
                        /node()"/>
        </xsl:call-template>
      </xsl:element>

      <xsl:call-template name="processInner">
       <xsl:with-param name="pNodes" select=
       "$pNodes[position()>$vbatchLength]"/>
      </xsl:call-template>
    </xsl:otherwise>
   </xsl:choose>
  </xsl:if>
 </xsl:template>

 <xsl:template name="getBatchLength">
  <xsl:param name="pNodes"/>
  <xsl:param name="pName"/>
  <xsl:param name="pCount"/>

  <xsl:choose>
   <xsl:when test=
   "not($pNodes) or not($pNodes[1]/self::*)
    or not(name($pNodes[1])=$pName)">
   <xsl:value-of select="$pCount"/>
   </xsl:when>
   <xsl:otherwise>
    <xsl:call-template name="getBatchLength">
     <xsl:with-param name="pNodes" select=
         "$pNodes[position()>1]"/>
     <xsl:with-param name="pName" select="$pName"/>
     <xsl:with-param name="pCount" select="$pCount+1"/>
    </xsl:call-template>
   </xsl:otherwise>
  </xsl:choose>
 </xsl:template>
</xsl:stylesheet>

when this transformation is applied to the following XML document (based on the provided, but made more complicated to show how more edge-cases are covered):

<w:p xmlns:w="w">
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t xml:space="preserve">This is a </w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t xml:space="preserve">bold </w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
            <w:i/>
        </w:rPr>
        <w:t>with a bit of italic</w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
            <w:i/>
        </w:rPr>
        <w:t> and some more italic</w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:i/>
        </w:rPr>
        <w:t> and just italic, no-bold</w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t xml:space="preserve"></w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t>paragr</w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t>a</w:t>
    </w:r>
    <w:r>
        <w:rPr>
            <w:b/>
        </w:rPr>
        <w:t>ph</w:t>
    </w:r>
    <w:r>
        <w:t xml:space="preserve"> with some non-bold in it too.</w:t>
    </w:r>
</w:p>

the wanted, correct result is produced:

<p><b>This is a bold <i>with a bit of italic and some more italic</i></b><i> and just italic, no-bold</i><b>paragraph</b> with some non-bold in it too.</p>

Explanation:

  1. This is a two-pass transformation. The first pass is relatively simple and converts the source XML document (in our specific case) to the following:

pass1 result (indented for readability):

<p>
   <b>This is a </b>
   <b>bold </b>
   <b>
      <i>with a bit of italic</i>
   </b>
   <b>
      <i> and some more italic</i>
   </b>
   <i> and just italic, no-bold</i>
   <b/>
   <b>paragr</b>
   <b>a</b>
   <b>ph</b> with some non-bold in it too.</p>

.2. The second pass (executed in mode "pass2") merges any batch of consecutive and identically named elements into a single element with that name. It recursively calls-itself on the children of the merged elements -- thus batches at any depth are merged.

.3. Do note: We do not (and cannot) use the axes following-sibling:: or preceding-sibling, because only the nodes (to be merged) at the top level are really siblings. Due to this reason we process all nodes just as a node-set.

.4. This solution is completely generic -- it merges any batch of consecutive identically-named elements at any depth -- and no specific names are hardcoded.

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • +1. I was sure you were able to accomplish this :)). An example of how things should be much more simple in XSLT 2.0 (should they be?) will be much appreciated, if you have time and will. Thanks – Emiliano Poggi Jun 16 '11 at 05:56
  • @empo: Thanks. I don't see how an XSLT 2.0 solution for this problem could be radically simpler. Besides not having to use `xxx:node-set()` (only once!) between the two passes, there isn't anything more that can be simplified. Maybe using `group-adjacent` ... but this isn't a too-big simplification. Not to mention that the @Jacqueline doesn't want an XSLT 2.0 solution. – Dimitre Novatchev Jun 16 '11 at 12:44
  • +1 - I wish I could upvote this more!!! Thank you so much. It works perfectly. You have definitely earned your bounty. – Jacqueline Jun 16 '11 at 14:37
  • @Jacqueline: You are welcome. Thanks for this excellent question -- last night I felt happy when my code worked. I love exactly such kind of difficult, seeming almost impossible to solve problems. – Dimitre Novatchev Jun 16 '11 at 15:59
  • @DimitreNovatchev this is brilliant! You don't happen to have a stylesheet to make the reverse happen? I.e. map HTML markups into OOXML? – silentsurfer Feb 25 '15 at 22:21
  • @silentsurfer, this is not a 1:1 mapping, therefore no "reverse" mapping exists. If somebody would specify strictly such a "reverse" mapping, then, using an HTML parser that produces an Object model (such as TagSoup, or SgmlReader), one can write a transformation that takes as input the tree produced by the HTML parser, and transforms it to the corresponding OOXML. Evan simpler is to open the HTML with Word, and to save this as XML :) – Dimitre Novatchev Feb 26 '15 at 03:28
3

This isn't really a complete solution, but it's far simpler than trying to do it with pure XSLT. Depending on the complexity of your source it might not be ideal either, but it might be worth a try. These templates:

<xsl:template match="w:p">
  <p>
    <xsl:apply-templates />
  </p>
</xsl:template>

<xsl:template match="w:r[w:rPr/w:b]">
  <b>
    <xsl:apply-templates />
  </b>
</xsl:template>

<xsl:template match="w:r[w:rPr/w:i]">
  <i>
    <xsl:apply-templates />
  </i>
</xsl:template>

<xsl:template match="w:r[w:rPr/w:i and w:rPr/w:b]">
  <b>
    <i>
      <xsl:apply-templates />
    </i>
  </b>
</xsl:template>

Will output <p><b>This is a </b><b>bold </b><b><i>with a bit of italic</i></b><b> </b><b>paragr</b><b>a</b><b>ph</b> with some non-bold in it too.</p>

You can then use simple text manipulation to remove any occurrences of </b><b>, and </i><i>, leaving you with:

<p><b>This is a bold <i>with a bit of italic</i> paragraph</b> with some non-bold in it too.</p>

Flynn1179
  • 11,925
  • 6
  • 38
  • 74
  • +1 for the good starting point. Two observations, the templates create (resolvable) ambiguities (fixed in my answer); from the point of view of HTML having such strange output (nested bolds like) is irrevelant and acceptable, even if a bit dirty. – Emiliano Poggi Jun 10 '11 at 12:48
  • "You can then use simple text manipulation to remove any occurrences of , and "...perhaps this is a task for HTML tidy. – Emiliano Poggi Jun 10 '11 at 13:16
  • Perhaps, but I'd have thought HTML tidy's probably a bit overkill for this purpose. There shouldn't be any ambiguities though, the XSLT spec's pretty clear about how these templates should be overridden. – Flynn1179 Jun 10 '11 at 13:42
  • There is, even if resolvable. Try to compile your code with Saxon for example. – Emiliano Poggi Jun 10 '11 at 13:44
  • Well, it's fairly easy to make it unambiguous by replacing `match="w:r[w:rPr/w:b]"` with `match="w:r[w:rPr/w:b and not(w:rPr/w:i)]"` for example, but it shouldn't really be a problem. – Flynn1179 Jun 10 '11 at 16:02
  • Thanks for your effort. I'm looking for a solution in which I don't need to have another process run, but perhaps this isn't possible. – Jacqueline Jun 10 '11 at 17:10
  • @Flynn: a cleaner way to make it unambiguous would be to put `priority="2"` on the template that matches `w:r[w:rPr/w:i and w:rPr/w:b]`. (http://www.w3.org/TR/xslt#conflict) (Also, +1 for a good starting point.) – LarsH Jun 10 '11 at 20:22
3

OOXML is a defined standard which has its own specification. To create a general transform from OOXML to HTML (that's interesting, even if I think there should be already existing implementations around the web) you should study at least a bit of the standard (and you need to study a bit of XSLT I think).

Generally (very generally), the contents of a WordML document is mainly composed by w:p (paragraphs) elements containing w:r runs (region of text with same properties). Inside each run, you can normally find the text properties of the region (w:rPr) and the text itself (w:t).

The model is much more intricated, but you can start working on this general structure.

For instance, you can start working with the following (a bit) general transform. Note that it manages only paragraphs with bold, italic and undelined text.


XSLT 2.0 tested under Saxon-HE 9.2.1.1J

<xsl:stylesheet version="2.0" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
    exclude-result-prefixes="w">
    <xsl:output method="html"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="w:document/w:body">
        <html>
            <body>
                <xsl:apply-templates select="w:p"/>
            </body>
        </html>
    </xsl:template>

    <!-- match paragraph -->
    <xsl:template match="w:p">
        <p>
            <xsl:apply-templates select="w:r"/>
        </p>
    </xsl:template>

    <!-- match run with property -->
    <xsl:template match="w:r[w:rPr]">
        <xsl:apply-templates select="w:rPr/*[1]"/>
    </xsl:template>

    <!-- Recursive template for bold, italic and underline
    properties applied to the same run. Escape to paragraph
    text -->
    <xsl:template match="w:b | w:i | w:u">
        <xsl:element name="{local-name(.)}">
            <xsl:choose>
                <!-- recurse to next sibling property i, b or u -->
                <xsl:when test="count(following-sibling::*[1])=1">
                    <xsl:apply-templates select="following-sibling::*
                        [local-name(.)='i' or 
                        local-name(.)='b' or 
                        local-name(.)='u']"/>
                </xsl:when>
                <xsl:otherwise>
                    <!-- escape to text -->
                    <xsl:apply-templates select="parent::w:rPr/
                        following-sibling::w:t"/>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:element>
    </xsl:template>

    <!-- match run without property -->
    <xsl:template match="w:r[not(w:rPr)]">
        <xsl:apply-templates select="w:t"/>
    </xsl:template>

    <!-- match text -->
    <xsl:template match="w:t">
        <xsl:value-of select="."/>
    </xsl:template>

</xsl:stylesheet>

Applied on:

<w:document xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml">
    <w:body>
        <w:p>
            <w:r>
                <w:rPr>
                    <w:b/>
                </w:rPr>
                <w:t xml:space="preserve">This is a </w:t>
            </w:r>
            <w:r>
                <w:rPr>
                    <w:b/>
                </w:rPr>
                <w:t xml:space="preserve">bold </w:t>
            </w:r>
            <w:r>
                <w:rPr>
                    <w:b/>
                    <w:i/>
                </w:rPr>
                <w:t>with a bit of italic</w:t>
            </w:r>
            <w:r>
                <w:rPr>
                    <w:b/>
                </w:rPr>
                <w:t xml:space="preserve"> </w:t>
            </w:r>
            <w:r>
                <w:rPr>
                    <w:b/>
                </w:rPr>
                <w:t>paragr</w:t>
            </w:r>
            <w:r>
                <w:rPr>
                    <w:b/>
                </w:rPr>
                <w:t>a</w:t>
            </w:r>
            <w:r>
                <w:rPr>
                    <w:b/>
                </w:rPr>
                <w:t>ph</w:t>
            </w:r>
            <w:r>
                <w:t xml:space="preserve"> with some non-bold in it too.</w:t>
            </w:r>
        </w:p>
    </w:body>
</w:document>

produces:

<html>
   <body>
      <p><b>This is a </b><b>bold </b><b><i>with a bit of italic</i></b><b> </b><b>paragr</b><b>a</b><b>ph</b> with some non-bold in it too.
      </p>
   </body>
</html>

The side effect of having grotesque HTML code is unavoidable, due to the WordML underlaying schema. Perhaps the task of making the final HTML much legible could be deferred to some user friendly (and powerful) utility like HTML tidy.

Emiliano Poggi
  • 24,390
  • 8
  • 55
  • 67
  • Thanks for all of the work, but I already had this. My issue and question were regarding generating output in which the formatting tags are combined. – Jacqueline Jun 10 '11 at 16:47
  • Well that's what the answer provides I think. Can you be more precise? – Emiliano Poggi Jun 10 '11 at 16:49
  • she wants contiguous multiple spans of ``, for example, to be collapsed into a single ``. – LarsH Jun 10 '11 at 17:02
  • @Jacqueline: I've seen your edit now, and I've just realized your intent now. Perhaps you could have shown us a bit of your transform. Hope you accept the @LarsH answer and upvote responsibly who helped you. Cheers – Emiliano Poggi Jun 10 '11 at 20:55
  • Yes, sorry for that mistake. I didn't see it until too late. I definitely appreciate the help. :) – Jacqueline Jun 10 '11 at 21:14
3

Another approach, similar to Flynn's but staying with XSLT instead of adding a separate text processing layer, would be to transform the initial HTML output in the same stylesheet to collapse the adjacent elements of <b> or <i> into single elements.

In other words, the stylesheet would first generate the initial HTML result tree, then pass that as input to a set of templates (using a special mode) that performed the collapsing operation.

Updated: Here is a working, 2-stage stylesheet, built on @empo's stage-1 stylesheet:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs w"
   xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml" version="2.0">

   <xsl:output method="html"/>
   <xsl:strip-space elements="*"/>
   <xsl:variable name="collapsibles" select="('i', 'b', 'u')"/>      

   <!-- identity template, except we collapse any adjacent b or i child elements. -->
   <xsl:template match="*" mode="collapse-adjacent">
      <xsl:copy>
         <xsl:copy-of select="@*"/>
         <xsl:for-each select="node()">
            <xsl:choose>
               <xsl:when test="index-of($collapsibles, local-name()) and
                     not(name(preceding-sibling::node()[1]) = name())">
                  <xsl:copy>
                     <xsl:copy-of select="@*"/>
                     <xsl:call-template name="process-niblings"/>
                  </xsl:copy>
               </xsl:when>
               <xsl:when test="index-of($collapsibles, local-name())"/>
               <!-- do not copy -->
               <xsl:otherwise>
                  <xsl:copy>
                     <xsl:copy-of select="@*"/>
                     <xsl:apply-templates mode="collapse-adjacent"/>
                  </xsl:copy>
               </xsl:otherwise>
            </xsl:choose>
         </xsl:for-each>
      </xsl:copy>
   </xsl:template>

   <!-- apply templates to children of current element *and* of all
      consecutively following elements of the same name. -->
   <xsl:template name="process-niblings">
      <xsl:apply-templates mode="collapse-adjacent"/>
      <!-- If immediate following sibling is the same element type, recurse with
         context node set to that sibling. -->
      <xsl:for-each
         select="following-sibling::node()[1][name() = name(current())]">
         <xsl:call-template name="process-niblings"/>
      </xsl:for-each>
   </xsl:template>

   <!-- @empo's stylesheet (modified) follows. --> 
   <xsl:template match="/">
      <html>
         <body>
            <xsl:variable name="raw-html">
               <xsl:apply-templates />
            </xsl:variable>
            <xsl:apply-templates select="$raw-html" mode="collapse-adjacent"/>            
         </body>
      </html>
   </xsl:template>

   <xsl:template match="w:document | w:body">
      <xsl:apply-templates />
   </xsl:template>

   <!-- match paragraph -->
   <xsl:template match="w:p">
      <p>
         <xsl:apply-templates select="w:r"/>
      </p>
   </xsl:template>

   <!-- match run with property -->
   <xsl:template match="w:r[w:rPr]">
      <xsl:apply-templates select="w:rPr/*[1]"/>
   </xsl:template>

   <!-- Recursive template for bold, italic and underline
      properties applied to the same run. Escape to paragraph
      text -->
   <xsl:template match="w:b | w:i | w:u">
      <xsl:element name="{local-name(.)}">
         <xsl:choose>
            <!-- recurse to next sibling property i, b or u -->
            <xsl:when test="count(following-sibling::*[1])=1">
               <xsl:apply-templates select="following-sibling::*
                  [index-of($collapsibles, local-name(.))]"/>
            </xsl:when>
            <xsl:otherwise>
               <!-- escape to text -->
               <xsl:apply-templates select="parent::w:rPr/
                  following-sibling::w:t"/>
            </xsl:otherwise>
         </xsl:choose>
      </xsl:element>
   </xsl:template>

   <!-- match run without property -->
   <xsl:template match="w:r[not(w:rPr)]">
      <xsl:apply-templates select="w:t"/>
   </xsl:template>

   <!-- match text -->
   <xsl:template match="w:t">
      <xsl:value-of select="."/>
   </xsl:template>

</xsl:stylesheet>

When tested again the sample input you gave, the above stylesheet yields

<html>
   <body>
      <p><b>This is a bold <i>with a bit of italic</i> paragraph</b> with some non-bold in it too.
      </p>
   </body>
</html>

which looks like what you wanted.

LarsH
  • 27,481
  • 8
  • 94
  • 152
  • @LarsH - this is looking like my best option so far - I prefer to stick with XSLT. – Jacqueline Jun 10 '11 at 17:52
  • @LarsH: thanks for introducing me to "two-stages" stylesheet processing; even if I need time to fully get it. Also happy you extended my transform. :) – Emiliano Poggi Jun 10 '11 at 20:13
  • @empo: You're welcome. Two-stage processing is just function composition: `f(g(x))`. It was unfortunately awkward in XSLT 1.0 but is easy in 2.0. – LarsH Jun 10 '11 at 20:16
  • @Jacqueline: completed the stylesheet. Don't know why I called you Jessica earlier. :-S – LarsH Jun 10 '11 at 20:17
  • @empo: just be glad I didn't call her Jezebel or something. :-) – LarsH Jun 10 '11 at 20:26
  • @Jacqueline: this solution has its limitations... e.g. it won't combine `my truck` into `my truck`, nor `my truck` into `my truck`. But hopefully it's good enough. As @empo pointed out, a browser won't care about the elegance of the tags. – LarsH Jun 10 '11 at 20:29
  • @LarsH - thanks so much. I know the browser won't care about the elegance of the tags, but unfortunately the comparison we do between the pre- and post- OOXML will. The 'my truck' situation will not be a problem, but the 'my truck' one will be. I will have to mull this over this weekend... with a glass of wine :) – Jacqueline Jun 10 '11 at 21:27