5

I have an XML file and I need to convert it into XQuery. consider a simple set of XML:

books[book]
book[@isbn, title, descrption]

eg:

<books>
    <book isbn="1590593049">
        <title>Extending Flash MX 2004</title>
        <description>
        Using javascript alongwith actionscript 3.0 and mxml.</description>
    </book>
    <book isbn="0132149184">
        <title>Java Software Solutions</title>
        <description>
            Complete book full of case studies on business solutions and design concepts while building mission critical
            business applications.
        </description>
    </book>

How to convert it to CSV format using XQuery? The CSV is used by Microsoft excel,

so it would be delimited by comma (,) character and special characters should be escaped.

Kindle Q
  • 944
  • 2
  • 19
  • 28
Ranjan Sarma
  • 1,565
  • 5
  • 22
  • 36

2 Answers2

4

Assuming your xml is in the variable $books you could create a csv file with each book node on a new line using this:

declare function local:my-replace($input) {
  for $i in $input
  return '"' || replace($i, '"', '""') || '"'
};
for $book in $books//book
return string-join(local:my-replace(($book/@isbn, $book/title, $book/description)), ",") || '&#xa;'

string-join concatenates the different strings, the local function my-replace replaces the values in the sequence according to your specification.

dirkk
  • 6,160
  • 5
  • 33
  • 51
  • the above snippet you posted works great in normal case, but In CSV, fields are spearated by a comma(,). Now if a text itself contains a comma, then excel will interpret that comma as a separator comma, when that comma was a part of the text, for example, part of the description cell or title cell. A typical approach is, to escape commas by wrapping double quotes (") and escape double quotes by adding double quotes ahead of double quotes: Hello, World from "Ms-Excel" will be: "Hello, World from ""Ms-Excel""". How to escape this way? Thanks for your great help. – Ranjan Sarma Dec 31 '12 at 07:16
  • I just edited my answer to reflect the specification you added. I just created a user-defined function my-replace, which adds a " at the beginning and the end of every element and replaces " with "". – dirkk Jan 01 '13 at 11:38
4

A pure XPath 2.0 expression:

for $b in /*/book
    return
      concat(escape-html-uri(string-join(($b/@isbn,
                                          $b/title,
                                          $b/description
                                          )
                                           /normalize-space(),
                                        ",")
                             ),
             codepoints-to-string(10))

XSLT 2 - based verification:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:sequence select=
   "for $b in /*/book
       return
         concat(escape-html-uri(string-join(($b/@isbn,
                                             $b/title,
                                             $b/description
                                             )
                                              /normalize-space(),
                                           ',')
                                ),
                codepoints-to-string(10))"/>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the provided XML document (corrected from its malformedness):

<books>
    <book isbn="1590593049">
        <title>Extending Flash MX 2004</title>
        <description>
        Using javascript alongwith actionscript 3.0 and mxml.</description>
    </book>
    <book isbn="0132149184">
        <title>Java Software Solutions</title>
        <description>
            Complete book full of case studies on business solutions and design concepts while building mission critical
            business applications.
        </description>
    </book>
</books>

the wanted, correct result is produced:

1590593049,Extending Flash MX 2004,Using javascript alongwith actionscript 3.0 and mxml.
 0132149184,Java Software Solutions,Complete book full of case studies on business solutions and design concepts while building mission critical business applications.

Update:

In a comment the OP has requested that any in-text comma be surrounded by a quote and that (after that) any quote be replaced by two quotes, and, finally, if the wholw result contains a quote, it must be surrounded by (single) quotes.

Here is a pure XPath 2.0 expression that produces this:

for $b in /*/book,
    $q in codepoints-to-string(34),
    $NL in codepoints-to-string(10),
    $isbn in normalize-space(replace($b/@isbn, ',', concat($q,',',$q))),
    $t in normalize-space(replace($b/title, ',', concat($q,',',$q))),
    $d in normalize-space(replace($b/description, ',', concat($q,',',$q))),
    $res in
     escape-html-uri(string-join(($isbn,$t,$d), ',')),
    $res2 in replace($res, $q, concat($q,$q))
   return
    if(contains($res2, $q))
       then concat($q, $res2, $q, $NL)
       else concat($res2, $NL)

When this XPath expression is evaluated against this (extended with a new test-case) XML document:

<books>
    <book isbn="1590593049">
        <title>Extending Flash MX 2004</title>
        <description>
        Using javascript alongwith actionscript 3.0 and mxml.</description>
    </book>
    <book isbn="0132149184">
        <title>Java Software Solutions</title>
        <description>
            Complete book full of case studies on business solutions and design concepts while building mission critical
            business applications.
        </description>
    </book>
    <book isbn="XX1234567">
        <title>Quotes and comma</title>
        <description>
            Hello, World from "Ms-Excel"
        </description>
    </book>
</books>

the wanted, correct result is produced:

1590593049,Extending Flash MX 2004,Using javascript alongwith actionscript 3.0 and mxml.
0132149184,Java Software Solutions,Complete book full of case studies on business solutions and design concepts while building mission critical business applications.
"XX1234567,Quotes and comma,Hello"","" World from ""Ms-Excel"""
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • but if it were pure XPath 2 it would not interpret the ' ' as line break – BeniBela Dec 30 '12 at 19:30
  • @BeniBela, Your question isn't clear -- I updated my answer with an XSLT transformation that uses the same XPath expression. In case this wasn't a legal XPath expression, there would be an error, but the transformation works without any problem. – Dimitre Novatchev Dec 30 '12 at 19:52
  • Well, if you include it in XSLT it is no longer *pure* XPath. Then the XML parser replaces the . If you just run the *pure* XPath without XSLT you get: `1590593049,Extending Flash MX 2004,Using javascript alongwith actionscript 3.0 and mxml. 0132149184,Java Software Solutions,Complete book full of case studies on business solutions and design concepts while building mission critical business applications. ` in the example – BeniBela Dec 30 '12 at 21:26
  • @BeniBela, You might be right. I wonder why XQuery processors recognize the character reference... Of course, to correct the noticed issue, one would simply replace ` ` with `codepoints-to-string(10)` – Dimitre Novatchev Dec 30 '12 at 21:58
  • @DimitreNovatchev, the above snippet you posted works great in normal case, but In CSV, fields are spearated by a comma(,). Now if a text itself contains a comma, then excel will interpret that comma as a separator comma, when that comma was a part of the text, for example, part of the description cell or title cell. A typical approach is, to escape commas by wrapping double quotes (") and escape double quotes by adding double quotes ahead of double quotes: Hello, World from "Ms-Excel" will be: "Hello, World from ""Ms-Excel""". How to escape this way? Thanks for your great help. – Ranjan Sarma Dec 31 '12 at 07:15