0

I am trying to crawl URLs in order to extract other URLs inside of each URL. To do such, I read the HTML code of the page, read each line of each, match it with a pattern and then extract the needed part as shown below:

    public class SimpleCrawler {
  static String pattern="https://www\\.([^&]+)\\.(?:com|net|org|)/([^&]+)";

 static Pattern UrlPattern = Pattern.compile (pattern);
 static Matcher UrlMatcher;



    public static void main(String[] args) {

            try {
            URL url = new URL("https://stackoverflow.com/");
            BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream()));
                       while((String line = br.readLine())!=null){
                        UrlMatcher= UrlPattern.matcher(line);


                if(UrlMatcher.find())
                {
            String extractedPath = UrlMatcher.group(1);
            String extractedPath2 = UrlMatcher.group(2);

            System.out.println("http://www."+extractedPath+".com"+extractedPath2);

                }
                }
        } catch (Exception ex) {
            ex.printStackTrace();
        }

    }

}

However, there some issue with it which I would like to address them:

  1. How is it possible to make either http and www or even both of them, optional? I have encountered many cases that there are links without either or both parts, so the regex will not match them.
  2. According to my code, I make two groups, one between http until the domain extension and the second is whatever comes after it. This, however, causes two sub-problems: 2.1 Since it is HTML codes, the rest of the HTML tags that may come after the URL will be extracted to. 2.2 In the System.out.println("http://www."+extractedPath+".com"+extractedPath2); I cannot make sure if it shows right URL (regardless of previous issues) because I do not know which domain extension it is matched with.
  3. Last but not least, I wonder how to match both http and https as well?
user3049183
  • 136
  • 1
  • 3
  • 16

2 Answers2

1

How about:

try {
    boolean foundMatch = subjectString.matches(
        "(?imx)^\n" +
        "(# Scheme\n" +
        " [a-z][a-z0-9+\\-.]*:\n" +
        " (# Authority & path\n" +
        "  //\n" +
        "  ([a-z0-9\\-._~%!$&'()*+,;=]+@)?              # User\n" +
        "  ([a-z0-9\\-._~%]+                            # Named host\n" +
        "  |\\[[a-f0-9:.]+\\]                            # IPv6 host\n" +
        "  |\\[v[a-f0-9][a-z0-9\\-._~%!$&'()*+,;=:]+\\])  # IPvFuture host\n" +
        "  (:[0-9]+)?                                  # Port\n" +
        "  (/[a-z0-9\\-._~%!$&'()*+,;=:@]+)*/?          # Path\n" +
        " |# Path without authority\n" +
        "  (/?[a-z0-9\\-._~%!$&'()*+,;=:@]+(/[a-z0-9\\-._~%!$&'()*+,;=:@]+)*/?)?\n" +
        " )\n" +
        "|# Relative URL (no scheme or authority)\n" +
        " ([a-z0-9\\-._~%!$&'()*+,;=@]+(/[a-z0-9\\-._~%!$&'()*+,;=:@]+)*/?  # Relative path\n" +
        " |(/[a-z0-9\\-._~%!$&'()*+,;=:@]+)+/?)                            # Absolute path\n" +
        ")\n" +
        "# Query\n" +
        "(\\?[a-z0-9\\-._~%!$&'()*+,;=:@/?]*)?\n" +
        "# Fragment\n" +
        "(\\#[a-z0-9\\-._~%!$&'()*+,;=:@/?]*)?\n" +
        "$");
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}
Thiago Souza
  • 135
  • 2
  • 8
0

With one library. I used HtmlCleaner. It does the job.

you can find it at: http://htmlcleaner.sourceforge.net/javause.php

another example (not tested) with jsoup: http://jsoup.org/cookbook/extracting-data/example-list-links

rather readable.

You can enhance it, choose < A > tags or others, HREF, etc...

or be more precise with case (HreF, HRef, ...): for exercise

import org.htmlcleaner.*;


public static Vector<String> HTML2URLS(String _source)
{
    Vector<String> result=new Vector<String>();

    HtmlCleaner cleaner = new HtmlCleaner();

    // Principal Node
    TagNode node = cleaner.clean(_source);

    // All nodes
    TagNode[] myNodes =node.getAllElements(true);

    int s=myNodes.length;
    for (int pos=0;pos<s;pos++)
        {
        TagNode tn=myNodes[pos];

        // all attributes
        Map<String,String> mss=tn.getAttributes();

        // Name of tag
        String name=tn.getName();

        // Is there href ?
        String href="";
        if (mss.containsKey("href")) href=mss.get("href");
        if (mss.containsKey("HREF")) href=mss.get("HREF");

        if (name.equals("a")) result.add(href);
        if (name.equals("A")) result.add(href);
        }
    return result;
}