3

I need to parse HTML and find corresponding CSS styles. I can parse HTML and CSS separataly, but I can't combine them. For example, I have an XHTML page like this:

<html>
<head>
<title></title>
</head>
<body>
<div class="abc">Hello World</div>
</body>
</html>

I have to search for "hello world" and find its class name, and after that I need to find its style from an external CSS file. Answers using Java, JavaScript, and PHP are all okay.

hopper
  • 13,060
  • 7
  • 49
  • 53
atknatk
  • 31
  • 1
  • 3
  • you could loop over all elements and check styles. This sounds like a very difficult task, since styles can overlap. Can you elaborate on your goal? Do you just need styles applied to text? – nycynik Nov 28 '12 at 21:28

4 Answers4

3

Use jsoup library in java which is a HTML Parser. You can see for example here
For example you can do something like this:

String html="<<your html content>>";
Document doc = Jsoup.parse(html);
Element ele=doc.getElementsContainingOwnText("Hello World").first.clone(); //get tag containing Hello world
HashSet<String>class=ele.classNames(); //gives you the classnames of element containing Hello world

You can explore the library further to fit your needs.

Narendra Rajput
  • 711
  • 9
  • 28
0

Using Java java.util.regex

String s = "<body>...<div class=\"abc\">Hello World</div></body>";
    Pattern p = Pattern.compile("<div.+?class\\s*?=\\s*['\"]?([^ '\"]+).*?>Hello World</div>", Pattern.CASE_INSENSITIVE);    Matcher m = p.matcher(s);
if (m.find()) {
    System.out.println(m.group(1));
}

prints abc

Evgeniy Dorofeev
  • 133,369
  • 30
  • 199
  • 275
  • Is the HTML really going to be that constant? If so, one would probably do as well just *looking at the source* and finding the info. :P If not, your regex will cause trouble. `
    Hello World
    ` would match and capture "broken", for example.
    – cHao Nov 28 '12 at 22:08
  • [Right, parse HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – GriffeyDog Nov 28 '12 at 22:13
  • 1
    never try to parse html or xml using regex, check this http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags –  Nov 28 '12 at 22:21
  • @OzHan in general I agree, but the question was about only finding "hello world" div class. Do you know a situation when regex cannot do that? – Evgeniy Dorofeev Nov 29 '12 at 05:25
  • @EvgeniyDorofeev: Any situation where you can't guarantee the HTML will always be written a certain way, is going to give a regex hell. Even something as simple as using single quotes instead of double, or not using quotes at all, would cause trouble. Both are valid in HTML, but would break a naive regex. And by the time you've built a regex capable of handling all the possible variations, you're at a point where simply using an existing, tested HTML parser would have saved you a lot of hair tearing. – cHao Nov 29 '12 at 07:00
  • Corrected my regex, now it's really hard to break – Evgeniy Dorofeev Nov 29 '12 at 09:20
0

Similiar question Can jQuery get all CSS styles associated with an element?. Maybe css optimizers can do what you want, take a look at unused-css.com its online tool but also lists other tools.

Community
  • 1
  • 1
dafyk
  • 1,042
  • 12
  • 24
0

As i understood you have chance to parse style sheet from external file and this makes your task easy to solve. First try to parse html file with jsoup which supports jquery like selector syntax that helps you parse complicated html files easier. then check this previous solution to parse css file. Im not going to full solution as i state with these libraries all task done internally and the only thing you should do is writing glue code to combine these two.

Community
  • 1
  • 1