getting links from html source code

Question

I have a string as html source code. I want to get only the links from that string and put these links into an ArrayList. As you know, I want to get some strings between <a href="THE LINK I WANT">But I want to do this without using any external libraries. How can I do it with simple algorithm using String classes and loops? Thank you!

Why would you not want to use a HTML parsing library for this? Doing this properly without a library will be reinventing a hugely complicated wheel. — Pekka, Mar 06 '12 at 10:46
Because it is an assignment and my instructor want me to do this with simple algorithm. Is it simple? — El3ctr0n1c4, Mar 06 '12 at 10:49
It is not that complicated, you can search through the html for `` in which case there is no `href` and you have to again start looking for the ` — prajeesh kumar, Mar 06 '12 at 10:59
@aphex: No, it isn't simple. HTML parsing isn't trivial. Any "simple" solution will break with non-trivial input such as ``. — RoToRa, Mar 06 '12 at 11:06
@RoToRa actually it was simple. I found the answer. Even so, thanks for your effort — El3ctr0n1c4, Mar 06 '12 at 16:12
No, it's not simple as the commenters above say. It might be simple to parse a small subset of HTML for an assignment, but it certainly isn't to do anything even slightly more complicated (see the example @RoToRa put up). — Siddhu, Jan 12 '15 at 20:59

score 5 · Answer 1 · edited May 23 '17 at 12:29

5

Java Regex API is not a proper tool to achieve your goal. Use efficient, secure and well tested high-level tools mentioned in the other answers.

If your question concerns rather Regex API than a real-life problem (learning purposes for example) - you can do it with the following code:

String html = "foo <a href='link1'>bar</a> baz <a href='link2'>qux</a> foo";
Pattern p = Pattern.compile("<a href='(.*?)'>");
Matcher m = p.matcher(html);
while(m.find()) {
   System.out.println(m.group(0));
   System.out.println(m.group(1));
}

And the output is:

<a href='link1'>
link1
<a href='link2'>
link2

Please note that lazy/reluctant qualifier *? must be used in order to reduce the grouping to the single tag. Group 0 is the entire match, group 1 is the next group match (next pair of parenthesis).

A NOTE to Consider :

Using regular expressions to pull values from HTML is always a mistake. HTML syntax is a lot more complex that it may first appear and it's very easy for a page to catch out even a very complex regular expression.

Use an HTML Parser instead. See also What are the pros and cons of the leading Java HTML parsers?

edited May 23 '17 at 12:29

Community

1
1

answered Mar 06 '12 at 10:53

Ramandeep Singh

5,063
3
28
34

As I stated my question, I don't want to use any external libraries. I found the answer. Even so, thanks for your answer – El3ctr0n1c4 Mar 06 '12 at 16:14
your method, as u stated in your answer, is just a workaround..not a proper method..!!..You can at least use "regex" to solve your problem..!! (and its not an external library) – Ramandeep Singh Mar 06 '12 at 16:17
Actually It is not necessarily proper, because I want you just a simple algorithm. I've solved though :D – El3ctr0n1c4 Mar 06 '12 at 19:16
Its your call..!!...but if you would show my answer to your instructor, he would be definitely surprised and happy..!!.. ;) – Ramandeep Singh Mar 07 '12 at 04:59

score 1 · Accepted Answer · answered Mar 06 '12 at 14:46

I've found the answer!!!!!

public ArrayList<String> getLinks() {

    String link = "";

    for(int i = 0; i<url.length()-6; i++) {
        if(url.charAt(i) == 'h' && url.charAt(i+1) == 'r') {
            for(int k = i; k<url.length();k++ ){
                if(url.charAt(k) == '>'){
                    link = url.substring(i+6,k-1);
                    links.add(link);
                    // Break the loop 
                    k = url.length();
                }
            }
        }
    }
    return links;

getting links from html source code

2 Answers2