0

I have a large XML dataset that needs to be parsed and converted to CSV. One of the elements in the XML is a procedure, a series of steps. The series of steps originated in a formatted screen where a lot of RTF coding allowed for bulleted lists, font differences, and so on. When exported from the database into my source XML, these formatted instructions became RTF codes in the xml, like this:

<SPECORMETHOD>{\rtf1\ansi\deff0\uc1\ansicpg1252\deftab720{\fonttbl{\f0\fnil\fcharset1 Arial;}{\f1\fnil\fcharset1 Garamond;}{\f2\fnil\fcharset0 Garamond;}{\f3\fnil\fcharset1 WingDings;}}{\colortbl\red0\green0\blue0;\red255\green0\blue0;\red0\green128\blue0;\red0\green0\blue255;\red255\green255\blue0;\red255\green0\blue255;\red128\green0\blue128;\red128\green0\blue0;\red0\green255\blue0;\red0\green255\blue255;\red0\green128\blue128;\red0\green0\blue128;\red255\green255\blue255;\red192\green192\blue192;\red128\green128\blue128;\red0\green0\blue0;}\wpprheadfoot1\paperw12240\paperh15840\margl720\margr720\margt720\margb720\headery720\footery720\endnhere\sectdefaultcl{\*\generator WPTools_5.17;}{\*\listtable{\list\listtemplateid1\listsimple{\listlevel\leveljc0\levelfollow0\levelstartat1\levelspace0\levelindent360\levelnfc0{\leveltext\'02\'00.;}{\levelnumbers\'01;}}\listid1}}{\*\listoverridetable{\listoverride\listid1\listoverridecount0\ls1}}{\ls1\ilvl0{\listtext 1.\tab}\li400\fi-400\plain\f2\fs26 Procedure Step 1.\par{\listtext\fs26 2.\tab}\plain\f2\fs26 Procedure Step 2.\par{\listtext\fs26 3.\tab}\plain\f2\fs26 Procedure Step 3.\par{\listtext\fs26 4.\tab}\plain\f2\fs26 Procedure Step 4.\par{\listtext\fs26 5.\tab}\plain\f2\fs26 Procedure Step 5.\par{\listtext\fs26 6.\tab}\plain\f2\fs26 Procedure Step 6.\par\pard\plain\plain\f2\fs26\par\plain\f2\fs26 Entry dated 02-07-2023\par}}</SPECORMETHOD>

If I save this content as RTF and open it in any word-like program and save it as text, I end up with the desired results:

1. Procedure Step 1.
2. Procedure Step 2.
3. Procedure Step 3. 
4. Procedure Step 4.
5. Procedure Step 5.
6. Procedure Step 6.
Entry dated 02-07-2023

However, I would prefer to do this dynamically in the XSLT flow, since there are tens of thousands of instances of procedures within the XML structure. If I separate them into files, I'd have to re-link them back into their correct position in the XML with extra steps (which is fine if I need to but seems inefficient).

I've tried:

  1. doing some intense pattern matching in XSLT using regular expressions. This helps me get part of the way there, but variations in authors and formatting are making this time consuming and difficult.
  2. I've looked at the Java Swing RTFEditorKit, but have not done any Java/XSLT integration before. I followed some examples in other questions, but receive "Reflexive calls to Java methods are not available under Saxon-HE" indicating I need the PE version. If this solution does work getting -PE is not a problem, but am unsure if it does. Looking for experience in this.

I'm using XML 1.1, XSLT 2.0 via saxon-he-11.3 on Java 17.0.4.1, all through Eclipse IDE 2022-12 (4.26.0).

At the end of the day, I am looking for suggestions in how best to approach this mass conversion of RTF to text within an XML element during XSLT processing.

Thanks, Michael

  • If you use Saxon 11 you are using XSLT 3.0 as Saxon since version 9.8 is an XSLT 3 processor. As for your question about calling into Java, you can do that also with HE if you are willing to write integrated extension functions documented in https://www.saxonica.com/html/documentation11/extensibility/extension-functions-J/. Reflexive do work indeed only with PE or EE. I have no idea, however, how easy/well that works with a Swing component. – Martin Honnen Feb 07 '23 at 19:58
  • As for processing RTF input, I would hope someone has done that in some library you could use but unfortunately my googling fails as RTF is also an abbreviation for "result tree fragment" from XSLT 1.0 and so I mainly find articles related to XSLT and result tree fragment limits or non limits. There are some powerful transformation frameworks in XSLT 2/3 like transpect but it seems it convers docx input with e.g. http://transpect.github.io/modules-docx2hub.html but not rtf. – Martin Honnen Feb 07 '23 at 20:07
  • Thanks, @MartinHonnen; I was hoping to find a library as well. For expediency, I think I'll give -PE a try and see if I can use the existing RTFEditorKit to read and write the strings into variables in XSLT. As for XSLT 3, I've not spent as much time there, so I've kept the stylesheet at 2.0 even though I know saxon is 3. Maybe a good opportunity to learn 3! It seems to have a lot of very advanced capability. Always the balance between get things done and learn something new to get things done better. Always appreciate your insight, Martin. – Michael Friedman Feb 07 '23 at 20:25
  • A nice solution would be to do this using "invisible XML". The idea is simple: if you can write a BNF grammar for RTF (or find one that has already been written), then Invisible XML will automatically convert the RTF to an equivalent XML document which you can then transform directly using XSLT. – Michael Kay Feb 08 '23 at 00:11
  • @MichaelKay, thanks for the suggestion. I'll take a look at this interesting approach. I did, in the short term acquire Saxon PE and am working on configuring it in Eclipse (which is keeping me humble, as getting Eclipse to locate the license file is somehow a challenge). – Michael Friedman Feb 08 '23 at 23:03
  • I'm afraid I have no experience with Eclipse, but I've heard that configuring class paths is even more difficult than in IntelliJ, which is saying something. – Michael Kay Feb 09 '23 at 07:49
  • @MichaelKay; I was finally able to get the license to be recognized. For posterity: Eclipse 2022-12 (4.26.0). Eclipse > Settings > XML > XSL Java Processors. "Add" a Java Processor called Saxon PE - 12.0 (or similar), choose "Saxon (XSLT 2.0)" and add the saxon-pe-12.0.jar library as "external Jar". Next, Settings > Java > Build Path > Classpath Variables. Create a new Variable Entry, name = LICENSE_FILE_LOCATION and set the folder to where the folder where the license is installed (e.g. /Library/SaxonPE-12-0J). [more...] – Michael Friedman Feb 10 '23 at 18:16
  • Part 2... Create a transformation scenario via "Run Configurations" (Green Play button in toolbar). Set main and output as desired. On the Processor tab, choose use specific processor and select the Saxon PE - 12.0 created earlier. Then in the Classpath tab, under User Entries, add the following external Jars (download if necessary): saxon-pe-12.0.jar, jdom-2.0.6.1.jar, dom4j-1.6.1.jar. Select User Entries, click "Advanced", Add Classpath Variables, then add the LICENSE_FILE_LOCATION variable. Save, Run and no more license warnings and the XSLT is processed. – Michael Friedman Feb 10 '23 at 18:20
  • Final note: There are probably more elegant ways to do this, but I have deadlines at the moment. When I added simply the saxon pe jar, I received errors about No Class Definition on jdom, then after adding that, dom4j. Adding that seems to have finalized everything. Apologies for the monstrosity. Incentive to learn better ways of learning how to do this stuff. In my use case, I am only running transformations on Java projects so did not need to add these things in the project itself. Once I realized Eclipse's run configurations is different than the project, this came together. – Michael Friedman Feb 10 '23 at 18:22

1 Answers1

0

I found Apache Tika as a converter of RTF to XHTML (https://tika.apache.org/2.7.0/examples.html#Parsing_to_XHTML) and managed to integrate it as an integrated extension function in Saxon 11 HE that takes the rtf string input and converts it into an XdmNode so in XSLT/XPath you can further process it as a normal input tree:

package org.example;

import net.sf.saxon.s9api.*;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.sax.ToXMLContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

import java.io.*;
import java.net.URI;
import java.net.URISyntaxException;

import org.apache.tika.parser.AutoDetectParser;
import org.xml.sax.XMLFilter;
import org.xml.sax.XMLReader;

import javax.xml.transform.sax.SAXSource;
import javax.xml.transform.stream.StreamSource;

public class Main {
    public static void main(String[] args) throws SaxonApiException {
        Processor processor = new Processor(false);

        processor.registerExtensionFunction(new ExtensionFunction() {
            @Override
            public QName getName() {
                return new QName("http://example.com/mf/tika", "parse-rtf");
            }

            public SequenceType getResultType() {
                return SequenceType.makeSequenceType(
                        ItemType.ANY_NODE, OccurrenceIndicator.ONE
                );
            }
            @Override
            public SequenceType[] getArgumentTypes() {
                return new SequenceType[]{
                        SequenceType.makeSequenceType(
                                ItemType.STRING, OccurrenceIndicator.ONE)};
            }

            @Override
            public XdmValue call(XdmValue[] xdmValues) throws SaxonApiException {
                try {
                    return parseRtfToHTML(xdmValues[0].itemAt(0).getStringValue(), processor);
                } catch (IOException | URISyntaxException e) {
                    throw new SaxonApiException(e);
                } catch (SAXException e) {
                    throw new SaxonApiException(e);
                } catch (TikaException e) {
                    throw new SaxonApiException(e);
                }
            }
        });

        XsltCompiler xsltCompiler = processor.newXsltCompiler();

        Xslt30Transformer xslt30Transformer = xsltCompiler.compile(new StreamSource(new File("sheet1.xsl"))).load30();

        XdmValue result = xslt30Transformer.applyTemplates(new StreamSource(new File("sample1.xml")));

        System.out.println(result);
    }

    public static XdmNode parseRtfToHTML(String rtf, Processor processor) throws IOException, SAXException, TikaException, URISyntaxException {
        DocumentBuilder docBuilder = processor.newDocumentBuilder();
        docBuilder.setBaseURI(new URI("urn:from-string"));

        ContentHandler handler = new ToXMLContentHandler();

        AutoDetectParser parser = new AutoDetectParser();
        Metadata metadata = new Metadata();
        try (InputStream stream = new ByteArrayInputStream(rtf.getBytes("utf8"))) {
            parser.parse(stream, handler, metadata);
            return docBuilder.build(new StreamSource(new StringReader(handler.toString())));
        } catch (SaxonApiException e) {
            throw new RuntimeException(e);
        }
    }
}

POM dependencies:

<dependencies>
    <dependency>
        <groupId>net.sf.saxon</groupId>
        <artifactId>Saxon-HE</artifactId>
        <version>11.4</version>
    </dependency>
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-core</artifactId>
        <version>2.7.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-parsers-standard-package</artifactId>
        <version>2.7.0</version>
    </dependency>
</dependencies>

With a sample like the one in your question and a stylesheet as follows

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="3.0"
                xmlns:xs="http://www.w3.org/2001/XMLSchema"
                xmlns:tika="http://example.com/mf/tika"
                exclude-result-prefixes="#all"
                expand-text="yes">

    <xsl:template match="SPECORMETHOD">
        <rtf-as-xhtml>
            <xsl:sequence select="tika:parse-rtf(.)"/>
        </rtf-as-xhtml>
    </xsl:template>

    <xsl:mode on-no-match="shallow-copy"/>

    <xsl:output indent="yes"/>

    <xsl:template match="/" name="xsl:initial-template">
        <xsl:next-match/>
        <xsl:comment>Run with {system-property('xsl:product-name')} {system-property('xsl:product-version')} {system-property('Q{http://saxon.sf.net/}platform')}</xsl:comment>
    </xsl:template>

</xsl:stylesheet>

the output is e.g.

<rtf-as-xhtml><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.microsoft.rtf.RTFParser"/>
<meta name="Content-Type" content="application/rtf"/>
<title/>
</head>
<body><p>Procedure Step 1.</p>
<p>Procedure Step 2.</p>
<p>Procedure Step 3.</p>
<p>Procedure Step 4.</p>
<p>Procedure Step 5.</p>
<p>Procedure Step 6.</p>
<p/>
<p>Entry dated 02-07-2023</p>
<p/>
</body></html></rtf-as-xhtml>
<!--Run with SAXON HE 11.4 -->

So in that simple demo I have made no effort to further process the XHTML returned by Tika from the integrated extension function but of course you can use the full set of XSLT 3.0/XPath 3.1 in Saxon 11 to select or transform it further.

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
  • Fantastic solution - and thanks for the extra work. The result output is much more handleable than the extensive RTF formatting. Thank you! – Michael Friedman Feb 08 '23 at 23:06
  • embarrassed to ask, but I've been working to implement your solution in Eclipse and can't seem to get past `org.eclipse.wst.xsl.jaxp.debug.invoker.internal.JAXPSAXProcessorInvoker Cannot find a 1-argument function named Q{http://example.com/mf/tika}parse-rtf();`. Can you explain how you run your Maven project? I am fairly sure this is Eclipse's bizarre configuration and my unmentored discovery process in an unfamiliar area. – Michael Friedman Feb 12 '23 at 22:16
  • I have the sample project online (but created with another IDE) https://github.com/martin-honnen/SaxonTikaRtfTest1, if that helps. `jaxp.debug.invoker.internal.JAXPSAXProcessorInvoker` doesn't look like an attempt to use Saxon and its s9api to me. – Martin Honnen Feb 12 '23 at 22:34
  • Thanks for that; this actually helped. I learned in Eclipse to Run > Java Project, then select the Main class and it's all working! I encountered one data issue so far. In one of these `SPECORMETHOD` elements, one of the RTF codes is a hyperlink pointing to an external document using a URI. During transformation, an error occurs as the hyperlink is created into the xhtml ``. `Error reported by XML parser: The element type "a" must be terminated by the matching end-tag "".: The element type "a" must be terminated by the matching end-tag "".` Next comment will have the source. – Michael Friedman Feb 13 '23 at 18:28
  • Fragment of source: `Caesar \f1\b\i DIP\f1\i0 : {\field{\*\fldinst{HYPERLINK "..\\\\..\\\\SAUCES\\\\Dips\\\\Dip, Caesar.doc"}}{\*\fldtitle{..\\\\..\\\\SAUCES\\\\Dips\\\\Dip, Caesar.doc}}{\fldrslt{\f1\cf3\cs103\ul\cs3 Dip, Caesar.doc\plain\f1\fs28\b}}}\par\plain\f1\fs28\tab` – Michael Friedman Feb 13 '23 at 18:29
  • As these are less important to me as functional URLs, I'm going to intercept the text before sending it to Tika and use regex to render the hyperlink as text. Good reference material: [link] (https://stackoverflow.com/questions/2850575/what-is-the-rtf-syntax-for-a-hyperlink) – Michael Friedman Feb 13 '23 at 18:50
  • Can you raise that as a separate question? I kind of suffer to read and decipher any rtf, having to do it in the comment here on StackOverflow doesn't seem necessary – Martin Honnen Feb 13 '23 at 18:50
  • For sure; I'll start a new related question. – Michael Friedman Feb 13 '23 at 19:00
  • Follow-up: (https://stackoverflow.com/questions/75440415/how-would-i-handle-rtf-hyperlinks-using-apache-tika-in-xslt) – Michael Friedman Feb 13 '23 at 19:29