2

I have a merged XML file which (for the sake of simplicity) has the following form:

<bookstore>
    <books>
        <book id="1"/>
        <book id="2"/>
        <book id="2"/>
        <book id="3"/>
        <book id="10"/>
    </books>
</bookstore>

Because the XML is merged, there are books with the same ID attribute. I need to make the ID’s unique following this rule: If an ID is encountered (from top to down) that is already taken, change this ID to MAX(ID)+1.

<bookstore>
    <books>
        <book id="1"/>
        <book id="2"/>
        <book id="11"/>
        <book id="3"/>
        <book id="10"/>
    </books>
</bookstore>

A straightforward way to do this would be to extract the ID’s, check their occurrence, and if it occurs more than once, then search the second occurrence (from top to down) of the ID and replace it. But this isn’t very elegant…

As I am reading about XML processing now, I was hoping for a (simple) XQuery which could do this. If anyone has some pointers or pseudo code: they are all welcome.

My environment is Oracle (PL)SQL database, supporting XMLTYPE and XQuery.

helderdarocha
  • 23,209
  • 4
  • 50
  • 65
Davor Josipovic
  • 5,296
  • 1
  • 39
  • 57
  • You can transform the document using XSLT, which has a `generate-id()` function that allows you to generate unique IDs based on node expressions. You can achieve that with a very short XSLT document containing a simple template which copies each node and replaces the `id` attribute contents of each `book` with a sequence created by the `generate-id()` function. – helderdarocha May 30 '14 at 22:10

1 Answers1

2

In your environment you can use XSLT 1.0 to transform the document and generate IDs during the process. See: DBMS_XSLPROCESSOR.

With a XSLT stylesheet you can copy the nodes from your XML source to a result tree, creating unique IDs in the process. The IDs will not be sequential numbers, but unique string sequences generated by the generate-id() method. You can't control what they look like, but you can guarantee they are unique. (XSLT also allows you to get rid of duplicate nodes (using a key) if that's your intention, but from your example I understood that duplicate *ID*s doesn't actually mean the node is a duplicate, since you want to generate a new ID for it.)

The stylesheet below has two templates. The second one is an identity transform: it simply copies elements and attributes to the result tree. The first template creates an attribute named id containing an unique ID.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:strip-space elements="*"/>
    <xsl:output indent="yes"/>

    <xsl:template match="book">
        <xsl:copy>
            <xsl:attribute name="id">
                <xsl:value-of select="generate-id(.)"/>
            </xsl:attribute>
            <xsl:apply-templates select="node()|@*[name() != 'id']"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

The other templates (in this case only the identity template) are called for all nodes and attributes, except the id attribute by <xsl:apply-templates ...>. The result is a copy of your original XML file with generated unique IDs for the book elements.

If you had a XML such as this one:

<bookstore>
    <books>
        <book id="1" other="123"/>
        <book id="2"/>
        <book id="2"/>
        <book id="3">
            <chapter number="123" id="ch1">Text</chapter>
        </book>
        <book id="10"/>
    </books>
    <magazines>
        <mag id="non-book-id"></mag>
    </magazines>
</bookstore>

the XSLT above would transform it into this XML:

<bookstore>
   <books>
      <book id="d2e3" other="123"/>
      <book id="d2e4"/>
      <book id="d2e5"/>
      <book id="d2e6">
         <chapter number="123" id="ch1">Text</chapter>
      </book>
      <book id="d2e9"/>
   </books>
   <magazines>
      <mag id="non-book-id"/>
   </magazines>
</bookstore>

(the string sequences are arbitrary, and might be different in your implementation).

For creating ID/IDREF links the generated string sequences are better than numbers since you can use them anywhere (numbers and identifiers that start with numbers can't always be used as IDs). But if string sequences are not acceptable and you need sequential numbers, you can use XPath node position() in XQuery or XSLT to generate a number based on the element's position in the whole document (which will be unique). If all books are siblings in the same context, you can simply replace the generate-id(.) in the stylesheet above for position():

<xsl:template match="book">
    <xsl:copy>
        <xsl:attribute name="id">
            <xsl:value-of select="position()"/>
        </xsl:attribute>
        <xsl:apply-templates select="node()|@*[name() != 'id']"/>
    </xsl:copy>
</xsl:template>

(if the books are not siblings, you will need to do it in a slightly different way, using a variable).

If you want to retain the existing IDs and only generate sequential ones for the duplicates, it will be a bit more complicated but you can achieve that with keys (or XQuery instead of XSLT). The maximum id can be obtained in XPath 2.0 using the max() function:

max(//book/@id)

That function does not exist in XPath 1.0, but you can obtain the maximum ID by using:

//book[not(@id < //book/@id)]/@id
helderdarocha
  • 23,209
  • 4
  • 50
  • 65
  • Is there any reason why you chose XML transformation and not XQuery Update Facility? As far as I see, XSLT is used to transform a document (even structure) and XQuery Update Facility to modify its contents. “Transformation” can maybe be seen as containing/enclosing “modification”, but it seems a bit of an overkill for what is needed here? Or am I wrong? – Davor Josipovic Jun 03 '14 at 07:39
  • @davor You are probably correct, and this question certainly deserves an XQuery answer. Regarding Oracle PL/SQL I'm just a user. I used XSLT because I'm much more experienced with it and saw it as a *possible* solution for this problem. Probably not the best solution, though. It would be great if you could add an XQuery answer. – helderdarocha Jun 03 '14 at 12:55
  • Eventually I went with a procedural PL/SQL & DOM approach. XQuery Update Facility is only supported since Oracle 12cR1, so I didn’t really go into its details. Such apparently simple update is not even trivial with pure SQL as discussed [here](http://stackoverflow.com/questions/23994409/update-statement-re-evaluate-subquery-after-each-set). Note: unique integers are required, not just unique strings and those that have no duplicates may not be changed. Until anything better comes along, I’ll mark your answer as accepted. – Davor Josipovic Jun 03 '14 at 16:14