Complex xPath query

Question

I need to write a quite complex XSLT 1.0 query.

Given the following XML file, I need a query to get the set of authors who are in multiple reports. (for example Antonio Rossi, because he's both on report 1 and 2).

<reports>
  <report id="01">
    <titolo>
      I venti del Nord
    </titolo>
    <autori>
      <autore>
        Antonio Rossi
      </autore>
      <autore>
        Mario Verdi
      </autore>
    </autori>
    <versioni>
      <versione numero="1.0">
        <data>
          13-08-1980
        </data>
        <autore>
          Mario Verdi
        </autore>
        <commento>
          versione iniziale
        </commento>
      </versione>
      <versione numero="2.0">
        <data>
          14-08-1981
        </data>
        <autore>
          Antonio Rossi
        </autore>
        <commento>
          poche modifiche
        </commento>
      </versione>
    </versioni>
  </report>
  <report id="02">
    <titolo>
      Le pioggie del Nord
    </titolo>
    <autori>
      <autore>
        Antonio Rossi
      </autore>
      <autore>
        Luca Bianchi
      </autore>
    </autori>
    <versioni>
      <versione numero="1.0">
        <data>
          13-12-1991
        </data>
        <autore>
          Antonio Rossi
        </autore>
        <commento>
          versione iniziale
        </commento>
      </versione>
      <versione numero="2.0">
        <data>
          14-08-1992
        </data>
        <autore>
          Antonio Rossi
        </autore>
        <commento>
          modifiche al cap. 1
        </commento>
      </versione>
      <versione numero="3.0">
        <data>
          18-08-1992
        </data>
        <autore>
          Antonio Rossi
        </autore>
        <commento>
          Aggiunta intro.
        </commento>
      </versione>
      <versione numero="4.0">
        <data>
          13-01-1992
        </data>
        <autore>
          Luca Bianchi
        </autore>
        <commento>
          Modifiche sostanziali.
        </commento>
      </versione>
    </versioni>
  </report>
  <report id="03">
    <titolo>
      Precipitazioni nevose
    </titolo>
    <autori>
      <autore>
        Fabio Verdi
      </autore>
      <autore>
        Luca Bianchi
      </autore>
    </autori>
    <versioni>
      <versione numero="1.0">
        <data>
          11-01-1992
        </data>
        <autore>
          Fabio Verdi
        </autore>
        <commento>
          versione iniziale
        </commento>
      </versione>
      <versione numero="2.0">
        <data>
          13-01-1992
        </data>
        <autore>
          Luca Bianchi
        </autore>
        <commento>
          Aggiornato indice
        </commento>
      </versione>
    </versioni>
  </report>
</reports>

score 5 · Answer 1 · answered Jul 13 '12 at 19:33

5

If you can use XPath 2.0 you could use:

distinct-values(/reports/report/autori/autore[preceding::report/autori/autore = . or following::report/autori/autore = .])

With your input XML it will return:

Antonio Rossi
Luca Bianchi

answered Jul 13 '12 at 19:33

Daniel Haley

51,389
6
69
95

Really nice answer. This answer deserves the tick. Although it would have been better to use normalize-space() in your comparison. – Sean B. Durkin Jul 14 '12 at 10:36

Petr Janeček · Accepted Answer · 2012-07-17T09:12:25.763

3

This works even in XPath 1.0:

//report//autore[text()=../../following-sibling::report//autore/text()]

It selects all autore nodes that have text content equal to any autore node in any of the following report nodes, too.

Or, to keep it short, even this should work if there's nothing really tricky in your real xml file:

//autore[text()=../../following-sibling::*//autore/text()]

EDIT: Working by accident. Please see the comments below.

edited Jul 17 '12 at 09:12

answered Jul 13 '12 at 19:45

Petr Janeček

37,768
12
121
145

1

Both of these solutions are wrong for a number of reasons. First straight comparison of the text nodes is wrong because the text child of autore clearly includes a certain amount of non-significant white-space in place for visual presentation. You need to protect against this with normalise-space(). Its also wrong because it doesn t produce a distinct list. If an author is present in 3 reports, he gets listed twice, contrary to the OP's stated requirement for a " set of authors". More on next comment ... – Sean B. Durkin Jul 14 '12 at 10:29
1

Given the wording of the question, it would have been more accurate to return a sequence of text nodes instead of elements. And finally it is horribly inefficient. The cost of the first expression will be proportional to the size of the document more than cubed. – Sean B. Durkin Jul 14 '12 at 10:34
@SeanB.Durkin You are actually very right. The `/text()` and `normalize-space()` problems are easily fixable (now I can see just how lucky I got when it worked on the sample input), but the problem a name popping up multiple times is a real issue which I think is unsolvable (or is it?) using only XPath 1.0. – Petr Janeček Jul 14 '12 at 12:45
I think is actually unsolvable with XPath 1.0. The other solutions from the other users all use either XSLT or XPath 2.0 indeed. – Gotenks Jul 14 '12 at 18:25

score 3 · Answer 3 · edited May 23 '17 at 12:25

I. This simple (no for-each, no variables) XSLT 1.0 transformation:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output omit-xml-declaration="yes" indent="yes"/>

  <xsl:key name="kAuthorByVal" match="autori/autore" use="normalize-space()"/>

  <xsl:template match="/">
   <xsl:copy-of select=
    "//autori/autore
                  [generate-id()
                  =
                   generate-id(key('kAuthorByVal', normalize-space())[1])
                   ]
                  [key('kAuthorByVal', normalize-space())[2]]"/>
  </xsl:template>
</xsl:stylesheet>

When applied on the provided XML document:

<reports>
      <report id="01">
        <titolo>
          I venti del Nord
        </titolo>
        <autori>
          <autore>
            Antonio Rossi
          </autore>
          <autore>
            Mario Verdi
          </autore>
        </autori>
        <versioni>
          <versione numero="1.0">
            <data>
              13-08-1980
            </data>
            <autore>
              Mario Verdi
            </autore>
            <commento>
              versione iniziale
            </commento>
          </versione>
          <versione numero="2.0">
            <data>
              14-08-1981
            </data>
            <autore>
              Antonio Rossi
            </autore>
            <commento>
              poche modifiche
            </commento>
          </versione>
        </versioni>
      </report>
      <report id="02">
        <titolo>
          Le pioggie del Nord
        </titolo>
        <autori>
          <autore>
            Antonio Rossi
          </autore>
          <autore>
            Luca Bianchi
          </autore>
        </autori>
        <versioni>
          <versione numero="1.0">
            <data>
              13-12-1991
            </data>
            <autore>
              Antonio Rossi
            </autore>
            <commento>
              versione iniziale
            </commento>
          </versione>
          <versione numero="2.0">
            <data>
              14-08-1992
            </data>
            <autore>
              Antonio Rossi
            </autore>
            <commento>
              modifiche al cap. 1
            </commento>
          </versione>
          <versione numero="3.0">
            <data>
              18-08-1992
            </data>
            <autore>
              Antonio Rossi
            </autore>
            <commento>
              Aggiunta intro.
            </commento>
          </versione>
          <versione numero="4.0">
            <data>
              13-01-1992
            </data>
            <autore>
              Luca Bianchi
            </autore>
            <commento>
              Modifiche sostanziali.
            </commento>
          </versione>
        </versioni>
      </report>
      <report id="03">
        <titolo>
          Precipitazioni nevose
        </titolo>
        <autori>
          <autore>
            Fabio Verdi
          </autore>
          <autore>
            Luca Bianchi
          </autore>
        </autori>
        <versioni>
          <versione numero="1.0">
            <data>
              11-01-1992
            </data>
            <autore>
              Fabio Verdi
            </autore>
            <commento>
              versione iniziale
            </commento>
          </versione>
          <versione numero="2.0">
            <data>
              13-01-1992
            </data>
            <autore>
              Luca Bianchi
            </autore>
            <commento>
              Aggiornato indice
            </commento>
          </versione>
        </versioni>
      </report>
</reports>

produces the wanted, correct result:

<autore>
            Antonio Rossi
          </autore>
<autore>
            Luca Bianchi
          </autore>

Explanation:

A key observation is that autori/autore having a specific string value cannot be present more than once within a report. This significantly simplifies the solution (for a more complex solution, look in the early versions of this answer). This consideration is substantially used in all solutions presented in this answer.
We define a key that identifies an autori/autore by its normalized string value. Thus two autori/autore with different whitespace but presenting the same author are treated as instances of the same author.
Using the Muenchian grouping method we select the set of all autori/autore elements each of which has a distinct normalized string value.
For each such selected autori/autore with unique normalized string value, we also test that there is a second such autori/autore that has the same normalized string value. We select all such autori/autore elements and this node-set is exactly what this problem requires to be selected.

II. XSLT 2.0 solution:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text"/>

 <xsl:variable name="vSeq" select="//autori/autore/normalize-space()"/>
 <xsl:template match="/">
     <xsl:value-of select="$vSeq[index-of($vSeq,.)[2]]" separator="&#xA;"/>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the same XML document (above), the wanted, correct result is produced:

Antonio Rossi
Luca Bianchi

Explanation:

Here we use this answer and define $vSeq accordingly.

III. A single XPath 3.0 (and XQuery 3.0) expression - solution:

let $vSeq := //autori/autore/normalize-space()
 return
    $vSeq[index-of($vSeq,.)[2]]

@SeanB.Durkin: The inclusion is the opposite -- XQuery 3.0 is a superset of (fully includes) XPath 3.0. — Dimitre Novatchev, Jul 14 '12 at 16:56

score 2 · Answer 4 · answered Jul 14 '12 at 14:17

Congrats to DevNull who got the first correct answer as it was posted at the time. At the time of his post, it was not known that the OP wanted an XSLT 1.0 solution. I provide one below.

Getting distinct values in XSLT 1.0, in any efficient way, requires Muenchian grouping. Here is how you could do it in XSLT 1.0 ...

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" />

<xsl:key name="kAuthors" match="autori/autore" use="normalize-space()" />

<xsl:template match="/">
The set of authors on multiple reports
====================================== 
<xsl:for-each select="reports/report/autori/autore[
   generate-id()=
   generate-id( key('kAuthors',normalize-space())[1])]">
  <xsl:variable name="author" select="normalize-space()" />   
  <xsl:for-each select="key('kAuthors',$author)[2]">
   <xsl:value-of select="concat($author,'&#x0A;')" /> 
  </xsl:for-each>
 </xsl:for-each>  
</xsl:template>

</xsl:stylesheet>

The above style-sheet when applied to the OP's sample data, produces this text document ...

The set of authors on multiple reports
====================================== 
Antonio Rossi
Luca Bianchi

Explanation

In each report the authors appear twice. Once under autori and again under versione. We don't need to double count on each report, so we make the match pattern for the key autori/autore. The key value is the author's name as a string. Thus the key groups authors.

We use standard Muenchian grouping to iterate through the authors. This is the outer for-each. Now we are just interested in the "repeat offenders". We can get this by applying a [2] predicate to the inner loop. Authors which only appear in at most 1 report will be filtered out as the length of their group is only 1.

Complex xPath query

4 Answers4

Explanation