I need to parse a malformed HTML-page and extract certain URLs from it as any kind of Collection. I don't really care what kind of Collection, I just need to be able to iterate over it.
Let's say we have a structure like this:
<html>
<body>
<div class="outer">
<div class="inner">
<a href="http://www.google.com" title="Google">Google-Link</a>
<a href="http://www.useless.com" title="I don't need this">Blah blah</a>
</div>
<div class="inner">
<a href="http://www.youtube.com" title="Youtube">Youtube-Link</a>
<a href="http://www.useless2.com" title="I don't need this2">Blah blah2</a>
</div>
</div>
</body>
</html>
And here is what I do so far:
// tagsoup version 1.2 is under apache license 2.0
@Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2' )
XmlSlurper slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser());
GPathResult nodes = slurper.parse("test.html");
def links = nodes."**".findAll { it.@class == "inner" }
println links
I want something like
["http://google.com", "http://youtube.com"]
but all I get is:
["Google-LinkBlah blah", "Youtube-LinkBlah blah2"]
To be more precise I can't use all URLs, because the HTML-document, that I need parse is about 15-thousand lines long and has alot of URLs that I don't need. So I need the first URL in each "inner" block.