How do you parse links from html using Java?

Question

I'm very much a Java novice. For my class we have to print out all of the links that are to be parsed from a user-inputted html source code.

Basically, I want to figure out how to take the string of the link that comes after the href attribute and do that for all links on the webpage, without using external methods (i.e. using arrays, substrings, and methods of strings but not importing other libraries).

Correct way: Proper HTML parser. For your class: I assume just simply regex. — LanguagesNamedAfterCofee, Oct 13 '12 at 18:27
Have you seen [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)? Not that I want to correct you, it's just such a great post :) — linski, Oct 13 '12 at 20:50
It can be done with the help of jsoup.More information can be found in the example http://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/ — jfk, Mar 04 '14 at 17:16

score 5 · Answer 1 · edited May 23 '17 at 11:43

5

Don't do it with Parser or RegExp. Try Jerry. Like (not tested):

Jerry doc = jerry(html);
doc.$("a").each(new JerryFunction() {
    public boolean onNode(Jerry $this, int index) {
        String href = $this.attr("href");
        System.out.println(href);
    }
}

or any html-friendly query language. Because of non-externals requirements try Trying to parse links in an HTML directory listing using Java

edited May 23 '17 at 11:43

Community

1
1

answered Oct 13 '12 at 18:38

Anton Bessonov

9,208
3
35
38

Thank you, but is there a way to do this that just uses substrings, arrays, and/or methods of String? Probably should have clarified in my original post. – user1743740 Oct 13 '12 at 18:48
@AntonBessonov, question was for java not js :) – Chirlo Oct 13 '12 at 18:59
1

Yes, you can. But it's very error prone, horrible to maintain and you write more as 7 lines above. Why you will do it with substrings or like? See http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not and http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 and http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html – Anton Bessonov Oct 13 '12 at 18:59
Ooops, my bad, +1 for the info :D – Chirlo Oct 13 '12 at 19:04
For the project, we're supposed to do it with substrings, etc. I'm sure it's very error-prone, but for the sake of the project I think my instructor will test with only basic html. – user1743740 Oct 13 '12 at 19:15

linski · Answer 2 · 2012-10-13T20:44:14.320

I don't know what class you are at, so the regular expression solution might be too advanced for you.
It might be the case if you are first year for example, but I can't really tell.

You could do it using substring or arrays but that is waaaay too much coding. That's why standard Java regular expressions exist:

String A_TAG_MATCHING_GROUP = "<a>([^<>]*)</a>";

Matcher matcher = Pattern.compile(A_TAG_MATCHING_GROUP).matcher("<html>\n<head>d\nadas</head><body><a>LINK_DESC_ONE</a>dsdasd<a>LINK_DESC_2</a></body></html>");
String url, linkDescription;
while (matcher.find()) {
        System.out.println(matcher.group(1));
}

Compile and run this code, then continue reading!

The crucial part is A_TAG_MATCHING_GROUP regular expression. As it is now, it will match an exact string " <a>" followed by:

none or as many characther's as you want (as denoted by star - *)
characther as stated above is defined as any character that is not (as denoted by caret - ^) "<" or ">" (exact term when something is inside square brackets - [ ] is character class)

So, if you write the A_TAG_MATCHING_GROUP regular expression well, with

matcher.group(i);

you'll get the url. Since it is for your class I won't write it for you :) Modify the matcher argument and play a little (change the hardcoded html string). Get some real html's and compare your output with real tool's output like this one.

Of course, you'll must read the given tutorial (this might be useful also) before, and here are relevant API links:

But, if you want to use "arrays and substrings", you could use the following algorithmn:

read the html character per character e.g.

String html ; for (Character c : s.toCharArray()) { //
}
when you get to the "<" remeber it (e.g. in a boolean variable first_char_of_a_tag_found)
decide will you immediatley want it to be followed by "a" char or you will allow line breaks and spaces. when you detect "a" remeber it in a boolean variable.
when you reach " href=" " start remebering the contents - might use a [substring()](http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#substring(int, int)) there on html string, and store its return value in a StringBuilder variable called url.

This a very low-level algorithm, but it will do the job. It requires a lot of coding and it is a monolithic, procedural approach.

Basically, loosley speaking you will be implementing an regular expression "engine" - the one I described in the first part of the post.

I programmed them both as assignments (first one for the job interview in Java, and the second one in C as an entry exam for a Java collegium) but in spite of the usual learning methodology (the second one first) I'd recommend the first one first - but it depends are you on tight schedule and what's your current knowledge.

Hope it helps :)

EDIT:

You can't parse HTML with regular expressions, but you can parse out url's from a tags with them. Not to be confused though, I'd definetly go with Jerry as Anton suggested.

You can see that Jerry like solutions are waay better in a real life from merely observing the size of his and mine post and time needed to process it, for starters :))

score 0 · Answer 3 · edited May 23 '17 at 11:48

0

You might want to consider some of these ideas

edited May 23 '17 at 11:48

Community

1
1

answered Oct 13 '12 at 20:38

btiernay

7,873
5
42
48

How do you parse links from html using Java?

3 Answers3

Linked