3

I'm writing a generic HTML explorer that can carry out a list of operations, such as visit page, find table, find rows, store data, etc. It uses Goutte/Guzzle internally, and thus can use CSS and XPath selectors. I have an interesting problem I'm stuck on regarding selecting a new set of results relative to an existing set of results.

Consider this demo HTML:

    <h2>Burrowing</h2>
    <ul>
        <li>
            <a href="/jobs/junior-mole">Junior Mole</a>
        </li>
        <li>
            <a href="/jobs/head-of-badger-partnerships">Head of Badger Partnerships</a>
        </li>
        <li>
            <a href="/jobs/trainee-worm">Trainee Worm</a>
        </li>
    </ul>

    <h2>Tree Surgery</h2>
    <ul>
        <li>
            <a href="/jobs/senior-woodpecker">Senior Woodpecker</a>
        </li>
        <li>
            <a href="/jobs/owl-supervisor">Owl Supervisor</a>
        </li>
    </ul>

    <h2>Grass maintenance</h2>
    <ul>
        <li>
            <a href="/jobs/trainee-sheep">Trainee sheep</a>
        </li>
        <li>
            <a href="/jobs/sheep-shearer">Sheep shearer</a>
        </li>
    </ul>

    <h2>Aerial supervision</h2>
    <ul>
        <li>
            <a href="/jobs/head-magpie-ops">Head of Magpie Operations</a>
        </li>
    </ul>

I run this CSS query to get the roles in the links (this correctly gets eight items):

ul li a

For each one, I'd like to get the category, which is the <h2> immediately preceding the <ul> in each case. Now I could do it with an absolute CSS selector thus:

h2

However that gets four results, so I don't know which category (h2) goes with which job (the link). I need to get eight results: three lots of the first category, two of the second, two of the third, and one of the fourth, so each category maps onto each role.

I wondered if I would need a parent selector for this, so I switched from CSS to XPath, and first tried this, which gets each h2 having an immediately following list item:

//h2[(following-sibling::ul)[1]/li/a]

That finds h2s having the specified parent structure, but again comes back with four results - no good.

Next attempt:

//ul/li[../preceding-sibling::h2[1]]

That gets the right number of results (based on getting a list item with an immediately preceding title) but gets the link text, not the category text.

I thought about doing a loop - I know I have eight results, so I could do this (X is an injected variable looping from 1 to 8). This works, but I regard the addition of a manual loop here rather inelegant - I'm trying to keep my rules as generic as possible:

//li[X]/../preceding-sibling::h2[1]

Is there an XPath operation that can return the required results? For the avoidance of doubt I am looking for the following (or just the text elements would be fine):

<h2>Burrowing</h2>
<h2>Burrowing</h2>
<h2>Burrowing</h2>
<h2>Tree Surgery</h2>
<h2>Tree Surgery</h2>
<h2>Grass maintenance</h2>
<h2>Grass maintenance</h2>
<h2>Aerial supervision</h2>

CSS would be fine too, but I assume that it's not possible because CSS doesn't have a parent operator (in any case, Goutte just converts CSS selectors into XPath selectors).

Since I am on PHP (5.5), I believe I have to stick to XPath 1.0.

halfer
  • 19,824
  • 17
  • 99
  • 186

2 Answers2

2

So I'm not sure how you are trying to use this but I'd try something like:

$links = $cralwer->filter('ul li a');
foreach ($links as $link) {
   // do stuff with the link
   // ...
   // get the H2
   $header = $link->parents()->filter('ul[../preceding-sibling::h2]');
   // do stuff with the header
}

Note this is untested and I came up with it from looking at the Symfony\Component\DomCrawler API directly, but I think it should work based on that (unless I have the XPath wrong - but if I do that should be pretty easy for you to work out).

You could of course also use the Symfony\Component\DomCrawler::each and do this inside of a closure instead of doing the foreach...

halfer
  • 19,824
  • 17
  • 99
  • 186
prodigitalson
  • 60,050
  • 10
  • 100
  • 114
  • Thanks for the suggestion! However, I'm trying to generalise my processing steps as much as possible - the "grab rows" of `ul li a` is fine, and the second expression you have is effectively a "grab row data" operation. However the `parents()` thing makes it less generic and ideally I'd like to get it to work without that (i.e. so when parsing a new page I just add various pre-defined step types, and don't have to write any PHP at all). I suppose `parents()` could be a step in itself though, so the process would be "grab these rows [xpath], traverse to parents, grab these columns [xpath]". – halfer Jan 30 '15 at 21:32
  • Interestingly, I just this minute found out that XPath 2.0 [has a `for` operation](http://www.xml.com/pub/a/2002/03/20/xpath2.html?page=2), so I guess this would be trivial in that version! However, I am stuck on 1.0, unless I can find the time to get a 2.0 parser working on the console and hack it into Goutte (not really worth the bother, IMO). – halfer Jan 30 '15 at 21:33
2

No, there is no single XPath 1.0 expression that returns what you want. Firstly because XPath 1.0 does not allow iterating over intermediate results and secondly because a sequence of items is defined as a node-set - in which there can be no duplicates.

I can see two possible solutions to your problem. Either write PHP code that

  • first retrieves all relevant a nodes, e.g. with an expression like //a
  • applies a second XPath expression to each of them in turn: preceding::h2[1]

You'd have to write that PHP code yourself, given my poor skills in it. But I can contribute an alternative: You could also use an XSLT 1.0 transformation, there are XSLT 1.0 processors in PHP.

Stylesheet

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:output method="xml" omit-xml-declaration="yes" indent="yes" />

    <xsl:template match="/">
      <xsl:for-each select="//a">
          <xsl:copy-of select="preceding::h2[1]"/>
      </xsl:for-each>
    </xsl:template>

</xsl:transform>

Applied to your input (after adding a root element), the result is

<h2>Burrowing</h2>
<h2>Burrowing</h2>
<h2>Burrowing</h2>
<h2>Tree Surgery</h2>
<h2>Tree Surgery</h2>
<h2>Grass maintenance</h2>
<h2>Grass maintenance</h2>
<h2>Aerial supervision</h2>

Try it online here. By the way, if you're interested in how to do it with XPath 2.0 using for, as you mentioned in a comment, see this version instead:

for $a in //a return $a/preceding::h2[1]
Mathias Müller
  • 22,203
  • 13
  • 58
  • 75
  • Ah, two good new ideas, much thanks. The `for` XPath is most frustrating, since it is perfect, requires no design changes in my app, but the syntax is not available! Bah. The XSLT is worth some consideration: as per my comment to prodigitalson, I'm making a general parser so I can scan any structure without writing any new PHP, and a general transformer step would be a useful addition. – halfer Jan 30 '15 at 22:22
  • (I might have a fish around to see if anyone has got XPath 2.0 to work with PHP in some fashion, maybe there's any acceptable hack. I'll note it on this page if I find something. It does rather look [like it is some way away](http://stackoverflow.com/questions/2085632/will-xpath-2-0-and-or-xslt-2-0-be-implemented-in-php)). – halfer Jan 30 '15 at 22:23