I'm trying to enumerate a list of hyperlinks (specifically the HREF component) from a string of HTML. The contents of each page are not too far off what early versions of Yahoo looked like (a series of hyperlinks broken into groupings by LI and UL tags.
We are parsing a series of previously hand-crafted HTML pages from an old system and want to pull only the meaningful content from each page rather than migrating the entire string. In my testing, my process is straight forward and is as follows:
- load the contents of the html page into a string
- parse the contents looking for "A" objects, but only after a specific tag with a specific class assigned
- for each list found, echo (for testing) the url (and ultimately write that item to our database).
I'm fairly sure that the best way to do this is with a regular expression, but from the examples I could find on stack overflow I wasn't able to get them working correctly (even to echo out found matches) and not much success with the DOM Parser either.
My test data looks like this:
<html>
<body>
<li><a href='beforelist.com'></a></li>
<ul class="summary">
<li><a href='test.com'></a></li>
<li><a href='test2.com'></a></li>
<li><a href='etc.com'></a></li>
</ul>
<li><a href='afterlist.com'></a></li>
<img src='/test.png'>
</body>
</html>
and am looking for output that matches (only after it finds the class='summary':
test.com
test2.com
etc.com
Everything outside of the summary grouping is ignored and is very unpredictable as to what it may include. I'm sure I'm missing something obvious and greatly appreciate any assistance! I never really understood how to write regex patterns correctly. :)