4

How do I best make an MS-Excel compatible csv file representing all documents in a MarkLogic directory Using XCC Java Client and Tomcat and Marklogic Both are remotely located. Number of document in the directory is around 15000.

Sofia
  • 771
  • 1
  • 8
  • 22
Ranjan Sarma
  • 1,565
  • 5
  • 22
  • 36
  • This sounds like "can you please do my job for me" type of question. – Eric Bloch Feb 22 '13 at 16:44
  • @EricBloch I fully agree. Please follow my original question on this link: http://stackoverflow.com/questions/14998342/marklogic-query-response-time-is-very-high#comment21083544_14998342 .... thanks. – Ranjan Sarma Feb 22 '13 at 17:46

1 Answers1

3

The first part, getting all the documents in a directory, is ready for us from avoiding XDMP-EXPNTREECACHEFULL and loading document

cts:search(
  collection(),
  cts:directory-query('path/to/documents/', 'infinity'))

As noted in my answer there, if you need further restrictions you could cts:and-query that cts:directory-query with other cts:query terms.

Next you need to turn each XML document into CSV. That's fairly simple, but you have to know how your XML is structured or have some way to infer it. For this example I will say that I always have simple child element a, b, c, d under some root element. So the query needs to produce a CSV header for those elements, followed by lines of CSV.

We probably also want to hand in the directory URI from the caller. If you were using REST this would use xdmp:get-request-field but for XCC it is an external value.

declare variable $DIRECTORY-URI as xs:string external ;

declare function local:csv($root as element()) as xs:string
{
  string-join(($root/a, $root/b, $root/c, $root/d), ',')
};

'A,B,C,D',
cts:search(
  collection(),
  cts:directory-query($DIRECTORY-URI, 'infinity'))/local:csv(*)

Again, making local:csv work for your application requires some knowledge of the XML or some way to infer its structure. You might need to put some values in double-quotes, too. But this basic structure is one of the most efficient ways to attack the problem. I've avoided any XQuery FLWOR expressions, so that the results can stream.

Another approach would be to use range indexes and http://docs.marklogic.com/cts:value-tuples with a cts:query to restrict the results, then convert the JSON to CSV. This would be even more efficient because no fragments would be fetched. But this won't work well with some XML structures, and you may not have the luxury of create a range index for every CSV field.

declare variable $DIRECTORY-URI as xs:string external ;

declare function local:csv($ja as json:array) as xs:string
{
  string-join(json:array-values($ja), ',')
};

'A,B,C,D',
local:csv(
  cts:value-tuples(
    (cts:element-reference(xs:QName('a')),
     cts:element-reference(xs:QName('b')),
     cts:element-reference(xs:QName('c')),
     cts:element-reference(xs:QName('d'))),
    (),
    cts:directory-query($DIRECTORY-URI, 'infinity')))
Community
  • 1
  • 1
mblakele
  • 7,782
  • 27
  • 45