parsing html string in java using regex

Question

Need help in parsing html string

String str = "<div id=\"test\" ><a href=\"#aaaa\"> Amrit </a> </div><div><a href=\"#bbbb\" > Amrit </a> </div><a href=\"#cccc\" ><a href=\"#dddd\" >";
String reg = ".*(<\\s*a\\s+href\\s*=\\s*\\\"(.+?)\"\\s*>).*";

str is my sample string and reg is my regex used to parse all the anchor tags, specially the value of href. Using this regex, it only shows the last part of the string.

    Pattern MY_PATTERN = Pattern.compile(reg);
    Matcher m = MY_PATTERN.matcher(str);
    while (m.find()) {
        for(int i=0; i<m.groupCount(); i++){
            String s = m.group(i);
            System.out.println("->" + s);
        }
    }

This is the code I did. What is missing?

And also if i want particular occurrence of string to be replaced, generally if I have my url changed form [string]_[string] into [string]-[string]. How can I get "_" and replace it by "-" ?

Do not parse HTML with a regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 Instead, use an XML parser. — , Nov 04 '11 at 17:22
[The pony he comes](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). — Dave Newton, Nov 04 '11 at 17:22
@JackManey XML parser does not work for all HTML. It only works for XHTML. — gigadot, Nov 04 '11 at 18:00

score 2 · Answer 1 · edited May 23 '17 at 12:21

2

Instead of parsing html using regex (regex is for regular language - HTML is not regular language) use HtmlUnit

http://htmlunit.sourceforge.net/

This may help: Options for HTML scraping?

edited May 23 '17 at 12:21

Community

1
1

answered Nov 04 '11 at 17:23

Kamil Lach

4,519
2
19
20

score 0 · Answer 2 · answered May 17 '12 at 17:22

I would suggest to use JSoup. It could be much more flexible than using a regex. A sample code is put below.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ListLinks {

    public static void main(String[] args) throws Exception {
        String url = "http://www.umovietv.com/EntertainmentList.aspx";
        Document doc = Jsoup.connect(url).get();
        Elements links = doc.select("a[href]");
        for (Element link : links) {
            print("%s", link.attr("abs:href"));
        }
    }

    private static void print(String msg, Object... args) {
        System.out.println(String.format(msg, args));
    }
}

Refer to http://jsoup.org/ for more information.

score 0 · Answer 3 · answered Nov 04 '11 at 19:55

0

It looks like you have a double escape too many.
This segment may fix it: "<\\s*a\\s+href\\s*=\\s*\"(.+?)\"\\s*>", but can't comment
on the entire regex if it works or not.

answered Nov 04 '11 at 19:55

parsing html string in java using regex

3 Answers3