Extract URL from href-tag in groovy

Question

I need to parse a malformed HTML-page and extract certain URLs from it as any kind of Collection. I don't really care what kind of Collection, I just need to be able to iterate over it.

Let's say we have a structure like this:

<html>
  <body>
    <div class="outer">
      <div class="inner">
        <a href="http://www.google.com" title="Google">Google-Link</a>
        <a href="http://www.useless.com" title="I don't need this">Blah blah</a>
      </div>
      <div class="inner">
        <a href="http://www.youtube.com" title="Youtube">Youtube-Link</a>
        <a href="http://www.useless2.com" title="I don't need this2">Blah blah2</a>
      </div>
    </div>
  </body>
</html>

And here is what I do so far:

// tagsoup version 1.2 is under apache license 2.0
@Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2' )
XmlSlurper slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser());

GPathResult nodes = slurper.parse("test.html"); 
def links = nodes."**".findAll { it.@class == "inner" }
println links

I want something like

["http://google.com", "http://youtube.com"]

but all I get is:

["Google-LinkBlah blah", "Youtube-LinkBlah blah2"]

To be more precise I can't use all URLs, because the HTML-document, that I need parse is about 15-thousand lines long and has alot of URLs that I don't need. So I need the first URL in each "inner" block.

tim_yates · Accepted Answer · 2013-03-18T10:53:30.190

5

As The Trav says, you need to grab the href attribute from each matching a tag.

You've edited your question so the class bit in the findAll makes no sense, but with the current HTML example, this should work:

def links = nodes.'**'.findAll { it.name() == 'a' }*.@href*.text()

Edit

If (as you say after the edit) you just want the first a inside anything marked with class="inner", then try:

def links = nodes.'**'.findAll { it.@class?.text() == 'inner' }
                 .collect { d -> d.'**'.find { it.name() == 'a' }?.@href }
                 .findAll() // remove nulls if there are any

edited Mar 18 '13 at 10:53

answered Mar 18 '13 at 09:43

tim_yates

167,322
27
342
338

Hi. Thank you for answering my question. As you stated I've edited my question to be more accurate. And I've done it again. I am sorry for that^^ – Jakunar Mar 18 '13 at 10:43
interesting use of findAll() to remove nulls, I haven't seen it used like that before – The Trav Apr 22 '13 at 07:05

score 0 · Answer 2 · answered Mar 18 '13 at 02:18

0

you're looking for @href on each of your nodes

answered Mar 18 '13 at 02:18

The Trav

1,955
3
22
30

Hey thanks for you reply! I appreciate it! I am sorry I wasn't very precise in the beginning so I updated my question. Again thanks for taking the time. :) – Jakunar Mar 18 '13 at 10:34

Extract URL from href-tag in groovy

2 Answers2

Edit