Crawling a URL in order to extract all the other URLs in that page

Question

I am trying to crawl URLs in order to extract other URLs inside of each URL. To do such, I read the HTML code of the page, read each line of each, match it with a pattern and then extract the needed part as shown below:

    public class SimpleCrawler {
  static String pattern="https://www\\.([^&]+)\\.(?:com|net|org|)/([^&]+)";

 static Pattern UrlPattern = Pattern.compile (pattern);
 static Matcher UrlMatcher;



    public static void main(String[] args) {

            try {
            URL url = new URL("https://stackoverflow.com/");
            BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream()));
                       while((String line = br.readLine())!=null){
                        UrlMatcher= UrlPattern.matcher(line);


                if(UrlMatcher.find())
                {
            String extractedPath = UrlMatcher.group(1);
            String extractedPath2 = UrlMatcher.group(2);

            System.out.println("http://www."+extractedPath+".com"+extractedPath2);

                }
                }
        } catch (Exception ex) {
            ex.printStackTrace();
        }

    }

}

However, there some issue with it which I would like to address them:

How is it possible to make either http and www or even both of them, optional? I have encountered many cases that there are links without either or both parts, so the regex will not match them.
According to my code, I make two groups, one between http until the domain extension and the second is whatever comes after it. This, however, causes two sub-problems: 2.1 Since it is HTML codes, the rest of the HTML tags that may come after the URL will be extracted to. 2.2 In the System.out.println("http://www."+extractedPath+".com"+extractedPath2); I cannot make sure if it shows right URL (regardless of previous issues) because I do not know which domain extension it is matched with.
Last but not least, I wonder how to match both http and https as well?

Just an idea, I have done something similar recently but I instead took the entire tag. It works for what I was doing because then the link and its title etc were already contained within the data. Might help depending on what you need to do. That way no matter what the URL started or ended with I got everything. You can add in a filter to exclude internal page links as well — Dave, Nov 26 '15 at 10:48
@Dave why not posting it as an answer? But I still need to improve this regex for matching those with or without `http`, `https` or `www.` for some other future analysis. — user3049183, Nov 26 '15 at 10:50
Because your question is specific to regex which I can't really help you with, my post was just a recommendation or idea — Dave, Nov 26 '15 at 10:52
@Dave I guess you only need to add a line to make a condition to fetch only the content of to the regex or something like this. — user3049183, Nov 26 '15 at 10:54
As said by @PeeHaa in another post 20mins ago `Stop trying to parse html with regex.` Use a html parser instead You should try to look at jsoup library. — naurel, Nov 26 '15 at 10:56
Not really an answer but I think that you should read: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) if you want to parse hmtl with regex — Sok Pomaranczowy, Nov 26 '15 at 11:00
@naurel does that still let me to get only the URLs inside each page or what? — user3049183, Nov 26 '15 at 11:09
@naurel and I would like to ask you to read my question again, I only need to get the URLs inside of a page. So in fact, I do not need to deal with tags. So I guess a parser is not useful? — user3049183, Nov 26 '15 at 11:16
With jsoup you can parse to get all href attribute for example. I've made a webcrawler with it 4 weeks ago. I didn't told you to look at jsoup to bother you. It's very usefull to do what you want. Check it out. — naurel, Nov 26 '15 at 11:45

score 1 · Answer 1 · answered Nov 26 '15 at 10:54

How about:

try {
    boolean foundMatch = subjectString.matches(
        "(?imx)^\n" +
        "(# Scheme\n" +
        " [a-z][a-z0-9+\\-.]*:\n" +
        " (# Authority & path\n" +
        "  //\n" +
        "  ([a-z0-9\\-._~%!$&'()*+,;=]+@)?              # User\n" +
        "  ([a-z0-9\\-._~%]+                            # Named host\n" +
        "  |\\[[a-f0-9:.]+\\]                            # IPv6 host\n" +
        "  |\\[v[a-f0-9][a-z0-9\\-._~%!$&'()*+,;=:]+\\])  # IPvFuture host\n" +
        "  (:[0-9]+)?                                  # Port\n" +
        "  (/[a-z0-9\\-._~%!$&'()*+,;=:@]+)*/?          # Path\n" +
        " |# Path without authority\n" +
        "  (/?[a-z0-9\\-._~%!$&'()*+,;=:@]+(/[a-z0-9\\-._~%!$&'()*+,;=:@]+)*/?)?\n" +
        " )\n" +
        "|# Relative URL (no scheme or authority)\n" +
        " ([a-z0-9\\-._~%!$&'()*+,;=@]+(/[a-z0-9\\-._~%!$&'()*+,;=:@]+)*/?  # Relative path\n" +
        " |(/[a-z0-9\\-._~%!$&'()*+,;=:@]+)+/?)                            # Absolute path\n" +
        ")\n" +
        "# Query\n" +
        "(\\?[a-z0-9\\-._~%!$&'()*+,;=:@/?]*)?\n" +
        "# Fragment\n" +
        "(\\#[a-z0-9\\-._~%!$&'()*+,;=:@/?]*)?\n" +
        "$");
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}

guillaume girod-vitouchkina · Answer 2 · 2015-11-26T11:58:22.067

0

With one library. I used HtmlCleaner. It does the job.

you can find it at: http://htmlcleaner.sourceforge.net/javause.php

another example (not tested) with jsoup: http://jsoup.org/cookbook/extracting-data/example-list-links

rather readable.

You can enhance it, choose < A > tags or others, HREF, etc...

or be more precise with case (HreF, HRef, ...): for exercise

import org.htmlcleaner.*;


public static Vector<String> HTML2URLS(String _source)
{
    Vector<String> result=new Vector<String>();

    HtmlCleaner cleaner = new HtmlCleaner();

    // Principal Node
    TagNode node = cleaner.clean(_source);

    // All nodes
    TagNode[] myNodes =node.getAllElements(true);

    int s=myNodes.length;
    for (int pos=0;pos<s;pos++)
        {
        TagNode tn=myNodes[pos];

        // all attributes
        Map<String,String> mss=tn.getAttributes();

        // Name of tag
        String name=tn.getName();

        // Is there href ?
        String href="";
        if (mss.containsKey("href")) href=mss.get("href");
        if (mss.containsKey("HREF")) href=mss.get("HREF");

        if (name.equals("a")) result.add(href);
        if (name.equals("A")) result.add(href);
        }
    return result;
}

edited Nov 26 '15 at 11:58

answered Nov 26 '15 at 11:32

guillaume girod-vitouchkina

3,061
1
10
26

Where to find ` org.htmlcleaner `? – user3049183 Nov 26 '15 at 11:37
What I meant was the `jar` for this API. – user3049183 Nov 26 '15 at 12:08
follow "download" page – guillaume girod-vitouchkina Nov 26 '15 at 12:22
what do you mean by "half" ? - it gives all url as href parameter inside A tags, with or without http, https, .dot, .net // ... – guillaume girod-vitouchkina Nov 26 '15 at 12:35
No no, u got me wrong, I want those parts. I meant my regex should give all URLs not only for http, but also for https, not only for www. but also for thoselinks without it. I did not mean to omit them from returned link. – user3049183 Nov 26 '15 at 12:37

Crawling a URL in order to extract all the other URLs in that page

2 Answers2