regex question for parsing xml

Question

I'm trying to get the text in between the tags <dev>Text Here</dev>:

<div id="tt" class="info">
  Text Here
</div>

Output: Text Here

How can I achieve this using regex in java? thanks.

EDIT:

I'm using HtmlUnit:

 currentPage.getElementById("tt").asXml();

 currentPage.getElementById("tt").asText(); // returns ""

[You shouldn't try to parse HTML with RegEx](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Bohemian, Aug 10 '11 at 01:10

score 5 · Answer 1 · answered Aug 10 '11 at 01:08

5

Don't. It's much easier to use a proper parser and just pull out the elements you are interested in. It's extremely difficult to use regular expressions for this kind of thing.

answered Aug 10 '11 at 01:08

Cameron Skinner

51,692
2
65
86

I have used `asText()` using `HtmlUnit` but it returns empty string `""` – Eng.Fouad Aug 10 '11 at 01:15

score 2 · Answer 2 · answered Aug 10 '11 at 10:38

NEVER try to parse XML or HTML using regular expressions.

It's theoretically impossible: the grammar of XML and HTML is a richer class of grammar than the grammar that regular expressions can process.
You'll get it wrong anyway, for reasons that have nothing to do with the theoretical limitations: there are too many subtleties like whitespace, CDATA sections, comments etc that you need to take into account.
There's no shortage of free off-the-shelf parsers that do the job properly, and fast.

score 1 · Answer 3 · answered Aug 10 '11 at 01:11

1

Also if it is HTML you are trying to parse or get try Jsoup, http://watchitlater.com/blog/2010/09/jsoup-beautifulsoup-for-java/

answered Aug 10 '11 at 01:11

Skylude

494
4
8
18

score 1 · Answer 4 · answered Aug 10 '11 at 01:16

You can use a regex for this, so long as you don't mind doing exactly what you said (and probably not what you meant):

Try the regexp <div.*>(.*)</div> on the string:

<div id="tt" class="info">
    <a href="../link.htm>Clicky</a>
</div>

You'll get the value <a href="../link.htm>Clicky</a>, instead of what you want, which is Clicky. Since XML can nest stuff without limit, regular expressions can't match them unless you make certain sacrifices (like handcoding for each level that you want to accomodate).

score 1 · Accepted Answer · answered Aug 10 '11 at 01:24

With regular expressions, you can use the following:

String s = "<div id=\"tt\" class=\"info\">\n  Text Here   \n</div>";
System.out.println(s);
Pattern p = Pattern.compile("<div id=\"tt\" class=\"info\">\\s*([^<]+?)\\s*</div>", Pattern.DOTALL);
Matcher m = p.matcher(s);
if (m.find()) {
    System.out.println(m.group(1));  // Text Here
}

However, a better solution would be to parse the HTML into XHTML, using JTidy, for example, and then extract the required text using XPath (//div[@id = 'tt']/text()). Something along these lines:

public static void main(String[] args) throws Exception {
    // Create a new JTidy instance and set options
    Tidy tidy = new Tidy();
    tidy.setXHTML(true); 

    // Parse an HTML page into a DOM document
    URL url = new URL("http://something.com/something.html");       
    Document doc = tidy.parseDOM(url.openStream(), System.out);

    // Use XPath to obtain whatever you want from the (X)HTML
    XPath xpath = XPathFactory.newInstance().newXPath();
    XPathExpression expr = xpath.compile("//div[@id = 'tt']/text()");
    String text = (String)expr.evaluate(doc, XPathConstants.STRING);
    System.out.println(text); // Text Here
}

regex question for parsing xml

5 Answers5