I'm writing a generic HTML explorer that can carry out a list of operations, such as visit page, find table, find rows, store data, etc. It uses Goutte/Guzzle internally, and thus can use CSS and XPath selectors. I have an interesting problem I'm stuck on regarding selecting a new set of results relative to an existing set of results.
Consider this demo HTML:
<h2>Burrowing</h2>
<ul>
<li>
<a href="/jobs/junior-mole">Junior Mole</a>
</li>
<li>
<a href="/jobs/head-of-badger-partnerships">Head of Badger Partnerships</a>
</li>
<li>
<a href="/jobs/trainee-worm">Trainee Worm</a>
</li>
</ul>
<h2>Tree Surgery</h2>
<ul>
<li>
<a href="/jobs/senior-woodpecker">Senior Woodpecker</a>
</li>
<li>
<a href="/jobs/owl-supervisor">Owl Supervisor</a>
</li>
</ul>
<h2>Grass maintenance</h2>
<ul>
<li>
<a href="/jobs/trainee-sheep">Trainee sheep</a>
</li>
<li>
<a href="/jobs/sheep-shearer">Sheep shearer</a>
</li>
</ul>
<h2>Aerial supervision</h2>
<ul>
<li>
<a href="/jobs/head-magpie-ops">Head of Magpie Operations</a>
</li>
</ul>
I run this CSS query to get the roles in the links (this correctly gets eight items):
ul li a
For each one, I'd like to get the category, which is the <h2>
immediately preceding the <ul>
in each case. Now I could do it with an absolute CSS selector thus:
h2
However that gets four results, so I don't know which category (h2) goes with which job (the link). I need to get eight results: three lots of the first category, two of the second, two of the third, and one of the fourth, so each category maps onto each role.
I wondered if I would need a parent selector for this, so I switched from CSS to XPath, and first tried this, which gets each h2 having an immediately following list item:
//h2[(following-sibling::ul)[1]/li/a]
That finds h2s having the specified parent structure, but again comes back with four results - no good.
Next attempt:
//ul/li[../preceding-sibling::h2[1]]
That gets the right number of results (based on getting a list item with an immediately preceding title) but gets the link text, not the category text.
I thought about doing a loop - I know I have eight results, so I could do this (X is an injected variable looping from 1 to 8). This works, but I regard the addition of a manual loop here rather inelegant - I'm trying to keep my rules as generic as possible:
//li[X]/../preceding-sibling::h2[1]
Is there an XPath operation that can return the required results? For the avoidance of doubt I am looking for the following (or just the text elements would be fine):
<h2>Burrowing</h2>
<h2>Burrowing</h2>
<h2>Burrowing</h2>
<h2>Tree Surgery</h2>
<h2>Tree Surgery</h2>
<h2>Grass maintenance</h2>
<h2>Grass maintenance</h2>
<h2>Aerial supervision</h2>
CSS would be fine too, but I assume that it's not possible because CSS doesn't have a parent operator (in any case, Goutte just converts CSS selectors into XPath selectors).
Since I am on PHP (5.5), I believe I have to stick to XPath 1.0.