Extracting textual content from XML documents using XSLT

Question

How it is possible to extract textual content of an XML document preferably using XSLT.

For such fragment,

<record>
    <tag1>textual content</tag1>
    <tag2>textual content</tag2>
    <tag2>textual content</tag2>
</record>

the desired result is :

textual content, textual content, textual content

What's the best format for output (table, CSV, etc,) in which the content be processable for further operation, such as text mining?

Thanks

Update

To extend the question, how it’s possible to extract content of each record separately. For example, for the below XML:

<Records>
<record id="1">
    <tag1>textual co</tag1>
    <tag2>textual con</tag2>
    <tag2>textual cont</tag2>
</record>
<record id="2">
    <tag1>some text</tag1>
    <tag2>some tex</tag2>
    <tag2>some te</tag2>
</record>
</Records>

The desired result should be such as:

(textual co, textual con, textual cont) , (some text, some tex, some te)

or in better format for further processing operations.

possible duplicate of [XML to CSV Using XSLT](http://stackoverflow.com/questions/365312/xml-to-csv-using-xslt) — kjhughes, Jan 19 '15 at 20:29

matthias_h · Answer 1 · 2015-01-19T20:57:36.100

2

Just an (updated) answer for the first part of the question - for the input in the question following XSLT

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" doctype-public="XSLT-compat" 
omit-xml-declaration="yes" encoding="UTF-8" indent="yes" />
<xsl:template match="record">
    <xsl:for-each select="child::*">
      <xsl:value-of select="normalize-space()"/>
      <xsl:if test="position()!= last()">, </xsl:if>
    </xsl:for-each>
  </xsl:template>
</xsl:transform>

has the result

textual content, textual content, textual content

The template matching record prints the value of each child element and adds , in case it's not the last element.

edited Jan 19 '15 at 20:57

answered Jan 19 '15 at 20:33

matthias_h

11,356
9
22
40

1

Seems a little fragile. If the non-significant whitespace were stripped out for some reason, this wouldn't work. Also, your `xsl:value-of` serves no purpose here. It doesn't output anything. – JLRishe Jan 19 '15 at 20:38
It doesn't produce any output. What's the problem? – Eilia Jan 19 '15 at 20:44
Just tested here http://xsltransform.net/bdxtq4 before posting and produces mentioned output, but just have a look at it. – matthias_h Jan 19 '15 at 20:50
Thanks @matthias_h, the problem is that the closing tag "" is missing in the posted solution. – Eilia Jan 19 '15 at 20:55
@EiliaAbraham Thanks for mentioning, you're right - just checked the first version and noticed it as I just added a second version. – matthias_h Jan 19 '15 at 21:01

Lingamurthy CS · Accepted Answer · 2015-01-20T06:11:05.623

1

You can use the following XSLT:

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
    <xsl:apply-templates select="//text()"/>
</xsl:template>
<xsl:template match="text()">
    <xsl:value-of select="."/>
    <xsl:if test="position() != last()">, </xsl:if>
</xsl:template>
</xsl:transform>

And for the update in the question, you can use the following XSLT:

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/*">
    <xsl:apply-templates/>
</xsl:template>
<xsl:template match="*">(<xsl:apply-templates select=".//text()"/>)<xsl:if test="position() != last()">, </xsl:if>
</xsl:template>
<xsl:template match="text()">
    <xsl:value-of select="."/>
    <xsl:if test="position() != last()">, </xsl:if>
</xsl:template>
</xsl:transform>

edited Jan 20 '15 at 06:11

answered Jan 20 '15 at 03:53

Lingamurthy CS

5,412
2
13
21

Seeing as your code is longer you should say what it's advantage is. – Ihe Onwuka Jan 20 '15 at 04:28
@IheOnwuka, it doesn't care about any of the elements in the document and operates just on text nodes. – Lingamurthy CS Jan 20 '15 at 06:05
@LingamurthyCS, Thanks for that, it's working. One more question, does XQuery provide some facilities to transform data (e.g. text) into array? – Eilia Jan 20 '15 at 06:17
@LingamurthyCS, I think it should be answered as a new question: http://stackoverflow.com/questions/28039499/transforming-xml-content-to-array – Eilia Jan 20 '15 at 07:09
@LingamurthyCS You do not need the first template and if you use the built in rules you don't need the if statement. Also you don't need the indent option when outputting text. See the update to my answer. – Ihe Onwuka Jan 20 '15 at 11:50
1

@IheOnwuka No, you are mistaken about this. Try it here: http://xsltransform.net/eiZQaFe. If you remove the first template, this changes the output. If you omit this template, the outermost element (`Records`) will also be matched by the second template. This answer is correct and isn't too long. – Mathias Müller Jan 21 '15 at 13:51
@Matthias. If you know the built in template rules you will know that the first template is not necessary. – Ihe Onwuka Jan 21 '15 at 14:43
@IheOnwuka I won't get notified if you spell my name with two "t"s. Other than that, I _know_ the built-in template rules very well. Why don't you test what I have said by removing the first template and see if the output changes instead of writing another nonsensical comment? – Mathias Müller Jan 21 '15 at 23:09
@MathiasMuller. The original question posed by the OP only requires a one template solution, because the built in template handles the last tag2 element. Test my revised solution on any XSLT processor you like. – Ihe Onwuka Jan 21 '15 at 23:26

Ihe Onwuka · Answer 3 · 2015-01-21T23:32:08.933

0

This is shorter and more generic in that it does not name any elements. It also exploits XSLT's built in templates which provide the language with default behaviour that lessens the amount you have to code. Assuming XSLT 1.0

Below is a shorter variation of lingamurthyCS's answer that let's the built-in template rule handle the last text node. It's analogous to my previous answer.

<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>

<xsl:template match="*[position() != last()]">
    <xsl:value-of select="."/><xsl:text>,</xsl:text>    
</xsl:template>
</xsl:transform>

However this particular job is better suited to XQuery.

Paste your XML into http://try.zorba.io/queries/xquery and just stick a /string-join(*,',') on the end of it like so

<record>
    <tag1>textual content</tag1>
    <tag2>textual content</tag2>
    <tag2>textual content</tag2>
</record>/string-join(*,',')

Exercise for the OP to translate that into XSLT 2.0 if that is what they are using.

edited Jan 21 '15 at 23:32

answered Jan 20 '15 at 01:20

Ihe Onwuka

467
1
3
11

@lhe, Thanks for the XQuery expression, it's fantastic! What about the case that noted in the update section of the question above? Any idea for that? – Eilia Jan 20 '15 at 06:04
Are you sure about the updated answer? [This](http://xsltransform.net/6qVRKw8) doesn't seem to work. – Lingamurthy CS Jan 20 '15 at 12:03
Try it here http://markbucayan.appspot.com/xslt/index.html. – Ihe Onwuka Jan 20 '15 at 12:28
Use xsltransform.net to "prove" your solution, and forget markbucayan.appspot.com. Saxon is the most compliant processor around. Your site's processor strips whitespace-only text nodes from the input tree, but not all processors do that. Also, your answer lacks the parentheses from the expected output. Finally, this job is _not_ better suited for XQuery. – Mathias Müller Jan 21 '15 at 13:54
@Matthias - Don't get hung up on online sites. I used one out of expediency. Learn the built in template rules. – Ihe Onwuka Jan 21 '15 at 14:50
You are really taxing my patience, but I will make one more attempt to explain it to you. The site you were using is not really useful for the task because you have no control over which XSLT processor is used. On the other hand, the site I suggested lets you choose from different processors, versions of Saxon among them, which is the most reliable processor. – Mathias Müller Jan 21 '15 at 23:15
I don't care about the site or your patience. Test the solution on your desktop. – Ihe Onwuka Jan 21 '15 at 23:31

Extracting textual content from XML documents using XSLT

3 Answers3

Linked