-1

There are many libraries(eg. Jsoup) which can do this task in a go but how can I get all the URLs present in the HTML content of any website using Java without using any external libraries?

Edit 1: Can anyone explain what scanner.useDelimiter("\Z") actually does and what is the difference between scanner.useDelimiter("\Z") and scanner.useDelimiter("\z").

Abhinav
  • 83
  • 1
  • 10
  • 1
    Maybe useless for you but useful for someone who may have just started. – Abhinav Nov 23 '19 at 19:43
  • Welcome to SO! @AbhinavMaurya, Jens's point is that this was an exceptionally broad question that would be difficult to answer in a useful way in the SO format. See [how to ask](http://www.stackoverflow.com/help/how-to-ask) a good question. – robsiemb Nov 23 '19 at 19:58

2 Answers2

2

I am answering my own question as I was trying to find the accurate answer on StackOverflow but couldn't find one.

Here is the code:

URL url;
ArrayList<String> finalResult = new ArrayList<String>();

try {
    String content = null;
    URLConnection connection = null;
    try {
        connection = new URL("https://yahoo.com").openConnection();
        Scanner scanner = new Scanner(connection.getInputStream());
        scanner.useDelimiter("\\Z");
        content = scanner.next();
        scanner.close();
        } catch (Exception ex) {
              ex.printStackTrace();
        }



    String regex = "(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";

    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(content);
    while (m.find()) {
    if(!finalResult.contains((m.group())))
      finalResult.add(m.group());
    }
} finally {
   for(String res: finalResult){
       System.out.println(res);
    }
}
Abhinav
  • 83
  • 1
  • 10
1

You can try using a regEx. Here is an example of a regEx that checks if any test is a URL or not. https://www.regextester.com/96504.

But I can't stop my self to say that Jsoup is what fits for this. but it's an external library.

Mohamed
  • 131
  • 1
  • 5