1

I need to get the content of an <a> html tag by a certain css class name. The css class that I need find is: whtbigheader

What I done so far is this:

    content = "<A HREF='/articles/0,7340,L-4664450,00.html' CLASS='whtbigheader' style='color:#FFFFFF;' HM=1>need to get this value</A>";

    Pattern p = Pattern.compile("<A.+?class\\s*?=[whtbigheader]['\"]?([^ '\"]+).*?>(.*?)</A>");
    Matcher m = p.matcher(content);

    if (m.find()) {
        System.out.println("found");
        System.out.println(m.group(1));
    }
    else {
        System.out.println("not found");
    }

The expected value is: need to get this value

More info:

  • Can use only regex
  • The content is an whole HTML String

Any ideas how to find it?

a_z
  • 342
  • 1
  • 3
  • 14
  • 1
    Is it mandatory to use regex? Can't you use a XML parser? It seems more appropriate in your case. – Michel Antoine Jun 04 '15 at 08:26
  • 2
    just take it as an advice : **never** use regex for HTML – nafas Jun 04 '15 at 08:28
  • 5
    Obligatory link to [Don't Parse HTML with Regex by Tony the Pony](http://stackoverflow.com/a/1732454/1509264) - but read the post below it as well. – MT0 Jun 04 '15 at 08:34
  • While *never* is not exactly the word that should be used, you really should avoid using regular expressions to parse HTML as much as is "programmerly" possible. – signus Jun 04 '15 at 08:36
  • 2
    @Signus I agree with you mate, I probably pumped it up abit to say never. but I have spent 4 days to parse a html with regex and eventually failing it. while it took about 20min to do it with jsoup (after I stopped the struggle with regex ofcourse ) – nafas Jun 04 '15 at 08:46
  • 1
    @nafas I too like to wave the "Never do that!" stick in this scenario, and while it is possible to do so, as the link above shows - it simply makes the programmers life *easier*. I think it's best to say to the OP "You can parse it with regex, but don't if you want to be happy! (or not hate yourself, or not hate coding, or give up coding, etc.)" :D – signus Jun 04 '15 at 08:53
  • I can use only regex – a_z Jun 04 '15 at 09:08
  • 1
    @a_z Any particular reason you can only use regex (i.e. restrictions on environment)? – signus Jun 04 '15 at 09:11
  • No, that's all I got, we're not using Jsoup and others – a_z Jun 04 '15 at 09:12
  • @a_z Are you unable to use JSoup, or not *allowed* to? – signus Jun 04 '15 at 09:13
  • 1
    Could this be a homework question? – Keale Jun 04 '15 at 09:16
  • no, i'm actually looking for the best practice to get values from html tags by css classes – a_z Jun 04 '15 at 09:17
  • IMHO the best practice is not to use regex in parsing html. – Keale Jun 04 '15 at 09:18
  • I also prefer `Jsoup`, but this is what I allowed here, sorry – a_z Jun 04 '15 at 09:19
  • @a_z If you were at a development shop, you wouldn't be limited to look for a best practice that is widely agreed upon as very very **bad practice**. No manager or technical lead will enforce such a requirement. This sounds like a homework problem as no other evidence supports it's not. – signus Jun 04 '15 at 09:20
  • 1
    Well, the highest upvoted answer right now is the most useful for other viewers of your question. If you cannot use Jsoup then by all means try the other answers that provide regex as a solution. – Keale Jun 04 '15 at 09:21

4 Answers4

4

I'm a hater of using regex for html parsing, that's why the solution might not be what the requester desires:

using Jsoup to achieve this :

String html; // your html code
Document doc = Jsoup.parse(html);
Elements elements=doc.select(".whtbigheader")`  //<-- that's it, it contains all the tags with whtbigheader as its class.

to make sure you only get a tag:

Elements elements=doc.select("a").select(".whtbigheader");

to get the text from you just need to loop through elements and get the text :

for(Element element : elements){
   System.out.println(element.text());
}

download link:

to download Jsoup 1.8.2 click here :).

nafas
  • 5,283
  • 3
  • 29
  • 57
1

Use non-capturing group instead of square brackets to match a word.

Pattern p = Pattern.compile("(?i)<A.+?class\\s*?=(['\"])?(?:whtbigheader)\\1[^>]*>(.*?)</A>");
Matcher m = p.matcher(content);

if (m.find()) {
    System.out.println("found");
    System.out.println(m.group(2));
}
else {
    System.out.println("not found");
}

DEMO

IDEONE

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
1

A parser is the more robust way to go for extracting information from HTML. However, in this case, it is possible to use a regular expression to get what you want (assuming you are never going to have nested anchor tags - if you do have nested anchor tags then you might want to sanity check your documents and you will definately need a parser).

You can use the following regex (using case insensitive flags):

"<a\\s+(?:[^>]+\\s+)?bclass\\s*=\\s*(?:whtbigheader(?=\\s|>)|(['\"])(?:(?:(?!\\1).)*?\\s+)*whtbigheader(?:\\s+(?:(?!\\1).)*?)*\\1)[^>]*>(.*?)</a>"

You want to extract the second group match like this:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Test {
static final Pattern ANCHOR_PATTERN = Pattern.compile(
        "<a\\s+(?:[^>]+\\s+)?class\\s*=\\s*(?:whtbigheader(?=\\s|>)|(['\"])(?:(?:(?!\\1).)*?\\s+)*whtbigheader(?:\\s+(?:(?!\\1).)*?)*\\1)[^>]*>(.*?)</a>",
        Pattern.CASE_INSENSITIVE
);
public static String getAnchorContents( final String html ){
    final Matcher matcher = ANCHOR_PATTERN.matcher( html );
    if ( matcher.find() ){
        return matcher.group(2);
    }
    return null;
}

public static void main( final String[] args ){
    final String[] tests = {
            "<a class=whtbigheader>test</a>",
            "<a class=\"whtbigheader\">test</a>",
            "<a class='whtbigheader'>test</a>",
            "<a class =whtbigheader>test</a>",
            "<a class =\"whtbigheader\">test</a>",
            "<a class ='whtbigheader'>test</a>",
            "<a class= whtbigheader>test</a>",
            "<a class= \"whtbigheader\">test</a>",
            "<a class= 'whtbigheader'>test</a>",
            "<a class = whtbigheader>test</a>",
            "<a class\t=\r\n\"whtbigheader\">test</a>",
            "<a class =\t'whtbigheader'>test</a>",
            "<a class=\"otherclass whtbigheader\">test</a>",
            "<a class=\"whtbigheader otherclass\">test</a>",
            "<a class=\"whtbigheader2 whtbigheader\">test</a>",
            "<a class=\"otherclass whtbigheader otherotherclass\">test</a>",
            "<a class=whtbigheader href=''>test</a>",
    };
    int successes = 0;
    int failures = 0;
    for ( final String test : tests )
    {
        final String contents = getAnchorContents( test );
        if ( "test".equals( contents ) )
            successes++;
        else
        {
            System.err.println( test + " => " + contents );
            failures++;
        }
    }
    final String[] failingTests = {
            "<a class=whtbigheader2>test</a>",
            "<a class=awhtbigheader>test</a>",
            "<a class=whtbigheader-other>test</a>",
            "<a class='whtbigheader2'>test</a>",
            "<a class='awhtbigheader'>test</a>",
            "<a class='whtbigheader-other'>test</a>",
            "<a class=otherclass whtbigheader>test</a>",
            "<a class='otherclass' whtbigheader='value'>test</a>",
            "<a class='otherclass' id='whtbigheader'>test</a>",
            "<a><aclass='whtbigheader'>test</aclass></a>",
            "<a aclass='whtbigheader'>test</a>",
            "<a class='whtbigheader\"'>test</a>",
            "<ab class='whtbigheader'><a>test</a></ab>",
    };
    for ( final String test : failingTests )
    {
        final String contents = getAnchorContents( test );
        if ( contents == null )
            successes++;
        else
        {
            System.err.println( test + " => " + contents );
            failures++;
        }
    }
    System.out.println( "Successful tests: " + successes );
    System.out.println( "Failed tests: " + failures );
}
}
MT0
  • 143,790
  • 11
  • 59
  • 117
0

You can use following regex :

/<a[^>]*class=\s?['"]\s?whtbigheader\s?['"][^>]*>(.*?)</a>/i

Demo

enter image description here

Note that if you just want content of tag a with a certain class you you don't need extra regex within tag only a[^>]*class='whtbigheader'[^>]* will do the job :

[^>]* will match any thing except >

Also you need to use modifier i (IGNORE CASE) for ignoring the case!


In addition, regex is not a good and proper way for parsing (?:X|H)TML documents.you may consider about using a proper Parser.

Note if you used quote for your regex you need to escape the quotes around class name.

Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • You also need to consider if the class is surrounded in double quotes and if there is whitespace between `class`, `=` and the class value. – MT0 Jun 04 '15 at 08:49
  • Also if there are multiple classes for the anchor. – MT0 Jun 04 '15 at 08:57