How to extract specific text from a webpage?

Question

I'm trying to extract a specific text from a webpage?

This is the part of the webpage which contains the specific text:

<div class="module">
<div class="body">
<dl class="per_info">
<dt>F.Name:</dt>
<dd><a class="nm" href="http://">a Variable Name1</a></dd>
<dt>L.Name:</dt>
<dd><a class="nm" href="http://">a Variable Name2</a></dd>
</dl>
</div>
</div>

How to extract the content of Variable Name1 and Variable Name2?

Is there any html parser could do this extraction?

+1: Finally there's someone who asks for a *parser* to parse HTML instead of asking for regular expressions. — Roland Illig, Sep 18 '11 at 18:49

score 0 · Answer 1 · answered Mar 12 '13 at 12:53

jsoup is a Java library that can parse HTML and extract element data. To use jsoup, first you create a jsoup Document by parsing it from a file, URL, whole document string, or HTML fragment string. A HTML fragment example is something like:

String html = "<div class='module'>" +
    "<div class='body'>" +
    "<dl class='per_info'>" +
    "<dt>F.Name:</dt>" +
    "<dd><a class='nm' href='http://'>a Variable Name1</a></dd>" +
    "<dt>L.Name:</dt>" +
    "<dd><a class='nm' href='http://'>a Variable Name2</a></dd>" +
    "</dl>" +
    "</div>" +
    "</div>";
Document doc = Jsoup.parseBodyFragment(html);

With the document, you can use jsoup's selectors to locate specific elements:

// select all <a/> elements from the document
Elements anchors = doc.select("a")

With the element collection, you can iterator over the elements and extract their element contents:

for (Element anchor : anchors) {
    String contents = anchor.text();
    System.out.println(contents);
}

score 0 · Answer 2 · answered Sep 18 '11 at 18:40

0

well, you can try Selenium, it loads the html page to your java code in a DOM-aware fashion, such that afterwards you can pick content of HTML elements based on id, xpath, etc.

http://seleniumhq.org/

answered Sep 18 '11 at 18:40

Shivan Dragon

15,004
9
62
103

score 0 · Answer 3 · answered Sep 18 '11 at 18:43

0

TagSoup is a SAX-compliant parser that is able to parse HTML found in the "wild". So there's no need for well formed XML.

answered Sep 18 '11 at 18:43

Christopher

694
7
15

How to extract specific text from a webpage?

3 Answers3