3

I'm trying to extract a specific text from a webpage?

This is the part of the webpage which contains the specific text:

<div class="module">
<div class="body">
<dl class="per_info">
<dt>F.Name:</dt>
<dd><a class="nm" href="http://">a Variable Name1</a></dd>
<dt>L.Name:</dt>
<dd><a class="nm" href="http://">a Variable Name2</a></dd>
</dl>
</div>
</div>

How to extract the content of Variable Name1 and Variable Name2?

Is there any html parser could do this extraction?

mwdar
  • 141
  • 2
  • 3
  • 10

3 Answers3

0

jsoup is a Java library that can parse HTML and extract element data. To use jsoup, first you create a jsoup Document by parsing it from a file, URL, whole document string, or HTML fragment string. A HTML fragment example is something like:

String html = "<div class='module'>" +
    "<div class='body'>" +
    "<dl class='per_info'>" +
    "<dt>F.Name:</dt>" +
    "<dd><a class='nm' href='http://'>a Variable Name1</a></dd>" +
    "<dt>L.Name:</dt>" +
    "<dd><a class='nm' href='http://'>a Variable Name2</a></dd>" +
    "</dl>" +
    "</div>" +
    "</div>";
Document doc = Jsoup.parseBodyFragment(html);

With the document, you can use jsoup's selectors to locate specific elements:

// select all <a/> elements from the document
Elements anchors = doc.select("a")

With the element collection, you can iterator over the elements and extract their element contents:

for (Element anchor : anchors) {
    String contents = anchor.text();
    System.out.println(contents);
}
Brent Worden
  • 10,624
  • 7
  • 52
  • 57
0

well, you can try Selenium, it loads the html page to your java code in a DOM-aware fashion, such that afterwards you can pick content of HTML elements based on id, xpath, etc.

http://seleniumhq.org/

Shivan Dragon
  • 15,004
  • 9
  • 62
  • 103
0

TagSoup is a SAX-compliant parser that is able to parse HTML found in the "wild". So there's no need for well formed XML.

Christopher
  • 694
  • 7
  • 15