how to implement build a selector for HTML DOM elements by its class name using regexp

Question

I have a question here. If I have a html file here.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
 <head>
  <title> New Document </title>
  <meta name="Generator" content="EditPlus">
  <meta name="Author" content="">
  <meta name="Keywords" content="">
  <meta name="Description" content="">
 </head>

<body>
<h1>Welcome to My Homepage</h1>
<p class="intro">My name is Donald.</p>
<h1 class="intro"><p class="important">Note that this is an important paragraph.</p>
</h1>
<div class="intro important"><p class="apple">I live in apple.</p></div>
<div class="intro important">I like apple.</p></div>
<p>I live in Duckburg.</p>
 </body>
</html>

Right now I want to get html element by class name. If the class name is ".intro", it should return:

My name is Donald.
<p class="important">Note that this is an important paragraph.</p>

If the class name is ".intro.important" it should return:

Note that this is an important paragraph.

If the class name is ".intro.important>.apple", it should return:

I live in apple.

I know jquery has class selector this function, but now I want to implement this function. Can I use java regexp to do this? It seems like that the class name is single string is ok. But if the class name has a child class name, it will make it hard. One more question, can java get the dom structure of the html?

score 1 · Answer 1 · edited May 23 '17 at 11:49

You can't parse [x]HTML with RegEx

It's that simple, RegExp was not built to cover the full grammar of XML and different tools need to be used for different jobs.

CSS Selectors not readily available

Unfortunately CSS selector parsers are not yet (afaik) a part of DOM parsers so you would need to use an XPath parser to achieve the same things as with CSS selectors.

There are however some projects such as jquery4j.org which port jQuery (+ widgets) to Java, but they don't bring CSS selectors to the table, the bring a lot more and I'm not sure if you really need all that.

XPath Selectors as an alternative to CSS Selectors

DOM parser + XPath parser for Java are the best approach. The DOM parser reads and load the HTML structure as DOM objects while the XPath parser uses (its own different type of selectors) to find objects within the DOM.

But be careful, don't feed the DOM parser huge amounts of HTML code (entire pages) unless you really need it to sift through it all. If you have a smaller piece of string that isolates the targeted area in the HTML where your info is present then it's better to use DOM with that. This is because DOM parsers are memory hungry beasts.

score 0 · Accepted Answer · answered Jun 27 '14 at 16:15

Can I use java regexp to do this?

You can create regex that selects nested content within tag with specific class name. I can give you regex that finds content within a tag but it doesn't care of class name:

<([a-z][a-z0-9]*+)[^>]*>.*?</\\1>

But if the class name has a child class name, it will make it hard.

In such case it is easier to use java string.

can java get the dom structure of the html?

Yes, it can be done with jsoup at jsoup.org.

how to implement build a selector for HTML DOM elements by its class name using regexp

2 Answers2

You can't parse [x]HTML with RegEx

CSS Selectors not readily available

XPath Selectors as an alternative to CSS Selectors