I am trying to parse screen-scraped data using Zend_Dom_Query, but I am struggling how to apply it properly for my case, and all other answers I have seen on SO make assumptions that quite frankly scare me with their naiveté.
A typical example is How to Pass Array from Zend Dom Query Results to table where pairs of data points are being extracted from the documents body through the use of separate calls to the query()
method.
$year = $dom->query('.secondaryInfo');
$rating = $dom->query('.ratingColumn');
Where the underlying assumptions are that an equal number of $year
and $rating
results exist AND that they are correctly aligned with each other within the document. If either of those assumptions are wrong, then the extracted data is less than worthless - in fact it becomes all lies.
In my case I am trying to extract multiple chunks of data from a site, where each chunk is nominally of the form:
<p class="main" atrb1="value1">
<a href="#1" >href text 1</a>
<span class="sub1">
<span class="span1"></span>
<span class="sub2">
<span class="span2">data span2</span>
<a href="#2">href text 2</a>
</span>
<span class="sub3">
<span class="span3">
<p>Some other data</p>
<span class="sub4">
<span class="sub5">More data</span>
</span>
</span>
</span>
</span>
</p>
For each chunk, I need to grab data from various sections:
- ".main"
- ".main a"
- ".main .span2"
- ".main .sub2 a"
- ".main .span3 p"
- etc
And then process the set of data as one distinct unit, and not as multiple collections of different data.
I know I can hard code the selection of each element (and I currently do that), but that produces brittle code reliant on the source data being stable. And this week the data source yet again changed and I was bitten by my hard coded scraping failing to work. Thus I am trying to write robust code that can locate what I want without me having to care/know about the overall structure (Hmmm - Linq for php?)
So in my mind, I want the code to look something like
$dom = new Zend_Dom_Query($body);
$results = $dom->query('.main');
foreach ($results as $result)
{
$data1 = $result->query(".main a");
$data2 = $result->query(".main .span2");
$data3 = $result->query(".main .sub a");
etc
if ($data1 && $data2 && $data3) {
Do something
} else {
Do something else
}
}
Is it possible to do what I want with stock Zend/PHP function calls? Or do I need to write some sort of custom function to implement $result->query()
?