-1

I am trying to extract links from HTML. I am using the following regular expression

href=\"([^\"]*)\"

Which is extracting unnecessary links. How can I write a regular expression to extract only links with class="l" like

<a href="http://users.elite.net/runner/jennifers/hello.htm" class="l">
<a href="http://www.hellodesign.com/" class="l">
<a href="http://www.ipl.org/div/hello/" class="l">
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
King Aslan
  • 169
  • 3
  • 8
  • 15
  • 3
    I'll post the obligatory [link](http://stackoverflow.com/a/1732454/960195) to a very famous answer that discourages parsing HTML with regex. – Adam Mihalcin Mar 20 '12 at 03:20
  • 2
    The coincidence is that I'm currently wearing a [shirt](http://meta.stackexchange.com/questions/108395/stack-overflow-t-shirt-3rd-anniversary-edition) with an extract of that epic post in the shape of a unicorn :) – BalusC Mar 20 '12 at 03:23

1 Answers1

2

Parsing HTML with regex is unnecessarily overcomplicated. Regex is the wrong tool for the job. Just use a normal HTML parser like Jsoup. It allows you to select HTML elements by normal CSS selectors.

Document document = Jsoup.parse(html);
Elements links = document.select("a.l"); // Select all <a class="l"> elements.

for (Element link : links) {
    System.out.println(link.absUrl("href"));
}
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • how to import jsoup into my JSP – King Aslan Mar 20 '12 at 03:29
  • Just drop the JAR file in `/WEB-INF/lib` folder the usual way to let it participate in the classpath. By the way, Java code [belongs](http://stackoverflow.com/questions/3177733/how-to-avoid-java-code-in-jsp-files) in a Java class (like a servlet), not in a JSP file. – BalusC Mar 20 '12 at 03:31
  • i cannot import jsoup into my JSP it is returing me cannot find symbol for Document,Elements... – King Aslan Mar 20 '12 at 03:37
  • That will indeed happen when it isn't in the runtime classpath. As said, drop the JAR file in `/WEB-INF/lib` the usual way. Then you can import it in your JSP file (or easier, Java class) the usual way as for every other class. – BalusC Mar 20 '12 at 03:37
  • can you please give me the code for importing i am trying to import like this : <%@page import="org.jsoup.Jsoup"%> it is returing error at org – King Aslan Mar 20 '12 at 03:43
  • That line looks fine. What IDE are you using and what error exactly did you get at org? Can you import it in a normal Java class in the same project? – BalusC Mar 20 '12 at 03:58
  • I am using Netbeans 7.0, error: package org.jsoup does not exists, yes i can import normal java class in this project. – King Aslan Mar 20 '12 at 04:01
  • Can you run the JSP file? Just ignore the error and run it. Probably Netbeans is just being a jerk. If running also fails, then the JAR is not in the `/WEB-INF/lib` at all. – BalusC Mar 20 '12 at 04:44
  • what should be the file name of jar file – King Aslan Mar 20 '12 at 05:15
  • Uh, just the one which you can download from the vendor's homepage. You should already have it, otherwise you wouldn't be able to import it in a normal Java class. You confirmed that this worked. – BalusC Mar 20 '12 at 05:16
  • actually the WEB-INF did not have lib folder, i created it copied that jar file from vendor's homepage...yet it is not working.... – King Aslan Mar 20 '12 at 06:41
  • hey successfully imported jsoup, it is executing fine, but i cannot get any output... – King Aslan Mar 20 '12 at 07:14
  • hello balusC are you there...?? – King Aslan Mar 21 '12 at 15:10
  • yes thank you very much, but can help me with this question http://stackoverflow.com/questions/9807186/jsoup-links-extraction – King Aslan Mar 21 '12 at 15:25
  • You already have the answer. You forgot to specify the protocol, so the URL is not an URL at all. – BalusC Mar 21 '12 at 15:26
  • your code is working to extract links from all website, but it is not working with google and aol. i need to extract links from say :http://www.google.com/search?q=mysearchkeyword – King Aslan Mar 21 '12 at 15:27
  • even after im specifying the protocol...only google and aol are not working, same is working with yahoo, bing and ask.... – King Aslan Mar 21 '12 at 15:29
  • my project is to implement a metasearch engine....i am able to extract links from yahoo, bing and ask...but same does not work with google and aol...what may be the reason..?? – King Aslan Mar 21 '12 at 15:32
  • hello balusC i need to include two elements variable in for loop, how to do it....ques http://stackoverflow.com/questions/9816605/jsoup-multi-element-output – King Aslan Mar 22 '12 at 04:50