12

I have a text file which looks like that:

XXX^YYYY^AAAAA^XXXXXX^AAAAAA....

Fields are separated using a caret(^), my presumptions are:

the first field = NAME
the second filed = Last name
third field = Address

etc..

I would like to turn it into a valid XML using xsl (XSLT). such as:

<name>XXX</name>
<l_name>YYYY</l_name>

I know It can be done easily with Perl, but I need to do it with XSLT, if possible.

vhu
  • 12,244
  • 11
  • 38
  • 48
snoofkin
  • 8,725
  • 14
  • 49
  • 86
  • Good question, +1. See my answer for a complete XSLT 1.0 solution and for a description of the more powerful text processing capabilities of XSLT 2.0 and a pointer to a real world XSLT 2.0 text processing example. – Dimitre Novatchev Apr 15 '11 at 13:29

2 Answers2

13

Text (non-XML) files can be read with the standard XSLT 2.0 function unparsed-text().

Then one can use the standard XPath 2.0 function tokenize() and two other standard XPath 2.0 functions that accept regular a expression as one of their arguments -- matches() and replace().

XSLT 2.0 has its own powerful instructions to handle text processing using regular expressions:: the <xsl:analyze-string>, the <xsl:matching-substring> and the <xsl:non-matching-substring> instruction.

See some of the more powerful capabilities of XSLT text processing with these functions and instructions in this real-world example: an XSLT solution to the WideFinder problem.

Finally, here is an XSLT 1.0 solution:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:ext="http://exslt.org/common"
 xmlns:my="my:my" exclude-result-prefixes="ext my">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <my:fieldNames>
  <name>FirstName</name>
  <name>LastName</name>
  <name>City</name>
  <name>State</name>
  <name>Zip</name>
 </my:fieldNames>

 <xsl:variable name="vfieldNames" select=
  "document('')/*/my:fieldNames"/>

 <xsl:template match="/">
  <xsl:variable name="vrtfTokens">
   <xsl:apply-templates/>
  </xsl:variable>

  <xsl:variable name="vTokens" select=
       "ext:node-set($vrtfTokens)"/>

  <results>
   <xsl:apply-templates select="$vTokens/*"/>
  </results>
 </xsl:template>

 <xsl:template match="text()" name="tokenize">
  <xsl:param name="pText" select="."/>

     <xsl:if test="string-length($pText)">
       <xsl:variable name="vWord" select=
       "substring-before(concat($pText, '^'),'^')"/>

       <word>
        <xsl:value-of select="$vWord"/>
       </word>

       <xsl:call-template name="tokenize">
        <xsl:with-param name="pText" select=
         "substring-after($pText,'^')"/>
       </xsl:call-template>
     </xsl:if>
 </xsl:template>

 <xsl:template match="word">
  <xsl:variable name="vPos" select="position()"/>

  <field>
      <xsl:element name="{$vfieldNames/*[position()=$vPos]}">
      </xsl:element>
      <value><xsl:value-of select="."/></value>
  </field>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied to the following XML document:

<t>John^Smith^Bellevue^WA^98004</t>

the wanted, correct result is produced:

<results>
   <field>
      <FirstName/>
      <value>John</value>
   </field>
   <field>
      <LastName/>
      <value>Smith</value>
   </field>
   <field>
      <City/>
      <value>Bellevue</value>
   </field>
   <field>
      <State/>
      <value>WA</value>
   </field>
   <field>
      <Zip/>
      <value>98004</value>
   </field>
</results>
Mads Hansen
  • 63,927
  • 12
  • 112
  • 147
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • +1 This _"I have a text file"_ require XSLT 2.0. (Unless you have a DTD's-internal-subset-aware XML parser) –  Apr 15 '11 at 13:24
  • @Alejandro: An entity is part of the XML document -- the OP wants to be able to read any file given its URL -- probably the file URL would be passed as a parameter to the stylesheet. BTW, I appended my answer with a complete XSLT 1.0 solution :) – Dimitre Novatchev Apr 15 '11 at 13:31
  • 1
    @Dimitre: This XML wrapper ` ]>&text;` with `test.txt` file as `John^Smith^Bellevue^WA^98004`, result in the same output. –  Apr 15 '11 at 14:11
  • @Alejandro: Yes. However this has nothing to do with XSLT -- only with XML. Also, let's not forget that due to security concerns many XML parsers disable entities by default. – Dimitre Novatchev Apr 15 '11 at 14:28
  • 1
    @Dimitre: Yes. And I think is a bad thing: security concerns about accessing external resource should be handle by the system. There are so many use for full DTD support... like getting the document URI with `<!ENTITY uri SYSTEM "#" NDATA uri>` and `unparsed-entity-uri('uri')` –  Apr 15 '11 at 14:39
  • +1 good answer @Dimitre. Michael Kay has a great article on text to xml conversion: Up-conversion using XSLT 2.0 http://www.saxonica.com/papers/ideadb-1.1/mhk-paper.xml – Steven D. Majewski Apr 15 '11 at 14:54
  • @Steven-D.-Majewski: Yes, I am aware of Michael Kay's article. I have personally achieved any text-processing tasks, including parsing of LR(1) languages with a generic parser, written entirely in XSLT 2.0. Interested? :) – Dimitre Novatchev Apr 15 '11 at 15:59
  • As usual, you provide the best possible answer, thanks a lot! – snoofkin Apr 17 '11 at 09:06
  • @Alejandro: You miss the point why entities represent a threat that cannot be prevented by the system: search for it. My hint is that it is extremely easy to launch a DOS attack using entities and this cannot be prevented in any other way than completely forbidding entities. Happy Binging :) – Dimitre Novatchev Apr 17 '11 at 14:22
1

Tokenizing and sorting with XSLT 1.0

If you use xslt 2.0 it's much simpler: fn:tokenize(string,pattern)

Example: tokenize("XPath is fun", "\s+")
Result: ("XPath", "is", "fun")
Community
  • 1
  • 1
VikciaR
  • 3,324
  • 20
  • 32