Get the list of all URLs on the website using Java

Question

There are many libraries(eg. Jsoup) which can do this task in a go but how can I get all the URLs present in the HTML content of any website using Java without using any external libraries?

Edit 1: Can anyone explain what scanner.useDelimiter("\Z") actually does and what is the difference between scanner.useDelimiter("\Z") and scanner.useDelimiter("\z").

Maybe useless for you but useful for someone who may have just started. — Abhinav, Nov 23 '19 at 19:43
Welcome to SO! @AbhinavMaurya, Jens's point is that this was an exceptionally broad question that would be difficult to answer in a useful way in the SO format. See [how to ask](http://www.stackoverflow.com/help/how-to-ask) a good question. — robsiemb, Nov 23 '19 at 19:58

score 2 · Accepted Answer · answered Nov 23 '19 at 19:31

I am answering my own question as I was trying to find the accurate answer on StackOverflow but couldn't find one.

Here is the code:

URL url;
ArrayList<String> finalResult = new ArrayList<String>();

try {
    String content = null;
    URLConnection connection = null;
    try {
        connection = new URL("https://yahoo.com").openConnection();
        Scanner scanner = new Scanner(connection.getInputStream());
        scanner.useDelimiter("\\Z");
        content = scanner.next();
        scanner.close();
        } catch (Exception ex) {
              ex.printStackTrace();
        }



    String regex = "(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";

    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(content);
    while (m.find()) {
    if(!finalResult.contains((m.group())))
      finalResult.add(m.group());
    }
} finally {
   for(String res: finalResult){
       System.out.println(res);
    }
}

score 1 · Answer 2 · answered Nov 23 '19 at 19:25

1

You can try using a regEx. Here is an example of a regEx that checks if any test is a URL or not. https://www.regextester.com/96504.

But I can't stop my self to say that Jsoup is what fits for this. but it's an external library.

answered Nov 23 '19 at 19:25

Mohamed

131
1
5

You cannot parse HTML with regex: https://stackoverflow.com/a/1732454/66686 – Jens Schauder Nov 23 '19 at 19:28

Get the list of all URLs on the website using Java

2 Answers2