I have a large XML dataset that needs to be parsed and converted to CSV. One of the elements in the XML is a procedure, a series of steps. The series of steps originated in a formatted screen where a lot of RTF coding allowed for bulleted lists, font differences, and so on. When exported from the database into my source XML, these formatted instructions became RTF codes in the xml, like this:
<SPECORMETHOD>{\rtf1\ansi\deff0\uc1\ansicpg1252\deftab720{\fonttbl{\f0\fnil\fcharset1 Arial;}{\f1\fnil\fcharset1 Garamond;}{\f2\fnil\fcharset0 Garamond;}{\f3\fnil\fcharset1 WingDings;}}{\colortbl\red0\green0\blue0;\red255\green0\blue0;\red0\green128\blue0;\red0\green0\blue255;\red255\green255\blue0;\red255\green0\blue255;\red128\green0\blue128;\red128\green0\blue0;\red0\green255\blue0;\red0\green255\blue255;\red0\green128\blue128;\red0\green0\blue128;\red255\green255\blue255;\red192\green192\blue192;\red128\green128\blue128;\red0\green0\blue0;}\wpprheadfoot1\paperw12240\paperh15840\margl720\margr720\margt720\margb720\headery720\footery720\endnhere\sectdefaultcl{\*\generator WPTools_5.17;}{\*\listtable{\list\listtemplateid1\listsimple{\listlevel\leveljc0\levelfollow0\levelstartat1\levelspace0\levelindent360\levelnfc0{\leveltext\'02\'00.;}{\levelnumbers\'01;}}\listid1}}{\*\listoverridetable{\listoverride\listid1\listoverridecount0\ls1}}{\ls1\ilvl0{\listtext 1.\tab}\li400\fi-400\plain\f2\fs26 Procedure Step 1.\par{\listtext\fs26 2.\tab}\plain\f2\fs26 Procedure Step 2.\par{\listtext\fs26 3.\tab}\plain\f2\fs26 Procedure Step 3.\par{\listtext\fs26 4.\tab}\plain\f2\fs26 Procedure Step 4.\par{\listtext\fs26 5.\tab}\plain\f2\fs26 Procedure Step 5.\par{\listtext\fs26 6.\tab}\plain\f2\fs26 Procedure Step 6.\par\pard\plain\plain\f2\fs26\par\plain\f2\fs26 Entry dated 02-07-2023\par}}</SPECORMETHOD>
If I save this content as RTF and open it in any word-like program and save it as text, I end up with the desired results:
1. Procedure Step 1.
2. Procedure Step 2.
3. Procedure Step 3.
4. Procedure Step 4.
5. Procedure Step 5.
6. Procedure Step 6.
Entry dated 02-07-2023
However, I would prefer to do this dynamically in the XSLT flow, since there are tens of thousands of instances of procedures within the XML structure. If I separate them into files, I'd have to re-link them back into their correct position in the XML with extra steps (which is fine if I need to but seems inefficient).
I've tried:
- doing some intense pattern matching in XSLT using regular expressions. This helps me get part of the way there, but variations in authors and formatting are making this time consuming and difficult.
- I've looked at the Java Swing RTFEditorKit, but have not done any Java/XSLT integration before. I followed some examples in other questions, but receive "Reflexive calls to Java methods are not available under Saxon-HE" indicating I need the PE version. If this solution does work getting -PE is not a problem, but am unsure if it does. Looking for experience in this.
I'm using XML 1.1, XSLT 2.0 via saxon-he-11.3 on Java 17.0.4.1, all through Eclipse IDE 2022-12 (4.26.0).
At the end of the day, I am looking for suggestions in how best to approach this mass conversion of RTF to text within an XML element during XSLT processing.
Thanks, Michael