Regex - how to find HTML tag content by it's class?

Question

I need to get the content of an <a> html tag by a certain css class name. The css class that I need find is: whtbigheader

What I done so far is this:

    content = "<A HREF='/articles/0,7340,L-4664450,00.html' CLASS='whtbigheader' style='color:#FFFFFF;' HM=1>need to get this value</A>";

    Pattern p = Pattern.compile("<A.+?class\\s*?=[whtbigheader]['\"]?([^ '\"]+).*?>(.*?)</A>");
    Matcher m = p.matcher(content);

    if (m.find()) {
        System.out.println("found");
        System.out.println(m.group(1));
    }
    else {
        System.out.println("not found");
    }

The expected value is: need to get this value

More info:

Can use only regex
The content is an whole HTML String

Any ideas how to find it?

Is it mandatory to use regex? Can't you use a XML parser? It seems more appropriate in your case. — Michel Antoine, Jun 04 '15 at 08:26
Obligatory link to [Don't Parse HTML with Regex by Tony the Pony](http://stackoverflow.com/a/1732454/1509264) - but read the post below it as well. — MT0, Jun 04 '15 at 08:34
While *never* is not exactly the word that should be used, you really should avoid using regular expressions to parse HTML as much as is "programmerly" possible. — signus, Jun 04 '15 at 08:36
@Signus I agree with you mate, I probably pumped it up abit to say never. but I have spent 4 days to parse a html with regex and eventually failing it. while it took about 20min to do it with jsoup (after I stopped the struggle with regex ofcourse ) — nafas, Jun 04 '15 at 08:46
@nafas I too like to wave the "Never do that!" stick in this scenario, and while it is possible to do so, as the link above shows - it simply makes the programmers life *easier*. I think it's best to say to the OP "You can parse it with regex, but don't if you want to be happy! (or not hate yourself, or not hate coding, or give up coding, etc.)" :D — signus, Jun 04 '15 at 08:53
@a_z Any particular reason you can only use regex (i.e. restrictions on environment)? — signus, Jun 04 '15 at 09:11
no, i'm actually looking for the best practice to get values from html tags by css classes — a_z, Jun 04 '15 at 09:17
I also prefer `Jsoup`, but this is what I allowed here, sorry — a_z, Jun 04 '15 at 09:19
@a_z If you were at a development shop, you wouldn't be limited to look for a best practice that is widely agreed upon as very very **bad practice**. No manager or technical lead will enforce such a requirement. This sounds like a homework problem as no other evidence supports it's not. — signus, Jun 04 '15 at 09:20
Well, the highest upvoted answer right now is the most useful for other viewers of your question. If you cannot use Jsoup then by all means try the other answers that provide regex as a solution. — Keale, Jun 04 '15 at 09:21

nafas · Answer 1 · 2015-06-04T09:15:46.527

4

I'm a hater of using regex for html parsing, that's why the solution might not be what the requester desires:

using Jsoup to achieve this :

String html; // your html code
Document doc = Jsoup.parse(html);
Elements elements=doc.select(".whtbigheader")`  //<-- that's it, it contains all the tags with whtbigheader as its class.

to make sure you only get a tag:

Elements elements=doc.select("a").select(".whtbigheader");

to get the text from you just need to loop through elements and get the text :

for(Element element : elements){
   System.out.println(element.text());
}

download link:

to download Jsoup 1.8.2 click here :).

edited Jun 04 '15 at 09:15

answered Jun 04 '15 at 08:37

nafas

5,283
3
29
57

1

Although this answer does not use regex, I still think this is the cleanest way to do the job. And [installing Jsoup to your project](http://stackoverflow.com/questions/19632560/installing-a-jar-file) is easy – Keale Jun 04 '15 at 09:13
`Jsoup` is not allowed – a_z Jun 04 '15 at 09:13
@a_z do you mean you are not allowed to use it ? – nafas Jun 04 '15 at 09:16
only regex is allowed – a_z Jun 04 '15 at 09:17
1

@a_z of well, u can come back to this question, if it didn't go as you plan mate – nafas Jun 04 '15 at 09:19

Avinash Raj · Answer 2 · 2015-06-04T08:33:18.137

1

Use non-capturing group instead of square brackets to match a word.

Pattern p = Pattern.compile("(?i)<A.+?class\\s*?=(['\"])?(?:whtbigheader)\\1[^>]*>(.*?)</A>");
Matcher m = p.matcher(content);

if (m.find()) {
    System.out.println("found");
    System.out.println(m.group(2));
}
else {
    System.out.println("not found");
}

DEMO

IDEONE

edited Jun 04 '15 at 08:33

answered Jun 04 '15 at 08:27

Avinash Raj

172,303
28
230
274

What about: `...` or `...` – MT0 Jun 04 '15 at 08:51
what if I got a lot of these cases, could you specify what should I only add for this case to the regex ? – a_z Jun 04 '15 at 09:29
Check out the test strings in my answer - this answer fails about half of them. – MT0 Jun 04 '15 at 09:43

MT0 · Accepted Answer · 2015-06-04T10:13:22.880

A parser is the more robust way to go for extracting information from HTML. However, in this case, it is possible to use a regular expression to get what you want (assuming you are never going to have nested anchor tags - if you do have nested anchor tags then you might want to sanity check your documents and you will definately need a parser).

You can use the following regex (using case insensitive flags):

"<a\\s+(?:[^>]+\\s+)?bclass\\s*=\\s*(?:whtbigheader(?=\\s|>)|(['\"])(?:(?:(?!\\1).)*?\\s+)*whtbigheader(?:\\s+(?:(?!\\1).)*?)*\\1)[^>]*>(.*?)</a>"

You want to extract the second group match like this:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Test {
static final Pattern ANCHOR_PATTERN = Pattern.compile(
        "<a\\s+(?:[^>]+\\s+)?class\\s*=\\s*(?:whtbigheader(?=\\s|>)|(['\"])(?:(?:(?!\\1).)*?\\s+)*whtbigheader(?:\\s+(?:(?!\\1).)*?)*\\1)[^>]*>(.*?)</a>",
        Pattern.CASE_INSENSITIVE
);
public static String getAnchorContents( final String html ){
    final Matcher matcher = ANCHOR_PATTERN.matcher( html );
    if ( matcher.find() ){
        return matcher.group(2);
    }
    return null;
}

public static void main( final String[] args ){
    final String[] tests = {
            "<a class=whtbigheader>test</a>",
            "<a class=\"whtbigheader\">test</a>",
            "<a class='whtbigheader'>test</a>",
            "<a class =whtbigheader>test</a>",
            "<a class =\"whtbigheader\">test</a>",
            "<a class ='whtbigheader'>test</a>",
            "<a class= whtbigheader>test</a>",
            "<a class= \"whtbigheader\">test</a>",
            "<a class= 'whtbigheader'>test</a>",
            "<a class = whtbigheader>test</a>",
            "<a class\t=\r\n\"whtbigheader\">test</a>",
            "<a class =\t'whtbigheader'>test</a>",
            "<a class=\"otherclass whtbigheader\">test</a>",
            "<a class=\"whtbigheader otherclass\">test</a>",
            "<a class=\"whtbigheader2 whtbigheader\">test</a>",
            "<a class=\"otherclass whtbigheader otherotherclass\">test</a>",
            "<a class=whtbigheader href=''>test</a>",
    };
    int successes = 0;
    int failures = 0;
    for ( final String test : tests )
    {
        final String contents = getAnchorContents( test );
        if ( "test".equals( contents ) )
            successes++;
        else
        {
            System.err.println( test + " => " + contents );
            failures++;
        }
    }
    final String[] failingTests = {
            "<a class=whtbigheader2>test</a>",
            "<a class=awhtbigheader>test</a>",
            "<a class=whtbigheader-other>test</a>",
            "<a class='whtbigheader2'>test</a>",
            "<a class='awhtbigheader'>test</a>",
            "<a class='whtbigheader-other'>test</a>",
            "<a class=otherclass whtbigheader>test</a>",
            "<a class='otherclass' whtbigheader='value'>test</a>",
            "<a class='otherclass' id='whtbigheader'>test</a>",
            "<a><aclass='whtbigheader'>test</aclass></a>",
            "<a aclass='whtbigheader'>test</a>",
            "<a class='whtbigheader\"'>test</a>",
            "<ab class='whtbigheader'><a>test</a></ab>",
    };
    for ( final String test : failingTests )
    {
        final String contents = getAnchorContents( test );
        if ( contents == null )
            successes++;
        else
        {
            System.err.println( test + " => " + contents );
            failures++;
        }
    }
    System.out.println( "Successful tests: " + successes );
    System.out.println( "Failed tests: " + failures );
}
}

Mazdak · Answer 4 · 2015-06-04T08:51:28.473

0

You can use following regex :

/<a[^>]*class=\s?['"]\s?whtbigheader\s?['"][^>]*>(.*?)</a>/i

Demo

enter image description here

Note that if you just want content of tag a with a certain class you you don't need extra regex within tag only a[^>]*class='whtbigheader'[^>]* will do the job :

[^>]* will match any thing except >

Also you need to use modifier i (IGNORE CASE) for ignoring the case!

In addition, regex is not a good and proper way for parsing (?:X|H)TML documents.you may consider about using a proper Parser.

Note if you used quote for your regex you need to escape the quotes around class name.

edited Jun 04 '15 at 08:51

answered Jun 04 '15 at 08:28

Mazdak

105,000
18
159
188

You also need to consider if the class is surrounded in double quotes and if there is whitespace between `class`, `=` and the class value. – MT0 Jun 04 '15 at 08:49
Also if there are multiple classes for the anchor. – MT0 Jun 04 '15 at 08:57

Regex - how to find HTML tag content by it's class?

4 Answers4