Issue with an url validation in Java

Question

I'm trying to validate an url, basically this will be for a project where any kind of weird url could be possibly be introduced by the user, like for example:

pickup.calendar.com/schedu/e-mail/pickup.php/123wec/99245t29-03882-4ttr-345nvnwikg3545/?
mypage-services.info
https://google.com
http://mypersonaldomain.com.org
mypersonaldomain2.com.org/services/123
https%3A%2F%2Mycoolstuff.com
wikipedia.org
www.github.com
https://testing.com/@myurl/about

I have tried using new URI(url) from java.net but it's not validating correctly, for values like +1.1234566 or just a number 123337000 or a word test it returns true but it shouldn't.

I have also tried the UrlValidator from apache-commons but an url like mypage-services.info says it's invalid.

Last thing i've tried is to use regex, this is what i've got so far:

^(http:\\/\\/|https:\\/\\/)?(www.)?([a-zA-Z0-9]+).[a-zA-Z0-9]*.[a-z]{3}.?([a-z]+)?$\n

But it's not working for url patterns like mypage-services.info

I know this seems already like a duplicate question but i have tried all of the regex i have found in similar questions and none of them have worked for my requirement.

To anybody who can provide me help on this i would appreciate it a lot, thank you.

This post, has a bunch of regular expressions you might be able to use. https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string/41242257 Full disclosure: One of the answers is mine. — Squazz, Feb 24 '23 at 09:40
Does this answer your question? [Regular expression to find URLs within a string](https://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string) — Squazz, Feb 24 '23 at 09:41

score 1 · Answer 1 · answered Feb 21 '23 at 00:33

If you want to match all of your examples, you are not actually validating URLs. The referenced Apache Commons URLValidator validates all of the above correctly.

URLs require by RFC to start with a "scheme", for example https:// or mailto:. When you enter wikipedia.org into your browser's address bar, it uses heuristics to best guess what you mean by that: the valid URL https://wikipedia.org. All of your examples can be best-guessed this way - by prepending https:// - except for https%3A%2F%2Mycoolstuff.com. That's a tougher case, because percent-encoding is not allowed in the scheme-part of an URL - But as such, it makes for a good example of how quick the guesswork gets ugly.

I'm not aware of a common library for that - probably there isn't one, because there is no universally correct way and it will at some point imply subjective decisions.

Here's how Firefox does it. It's a lot of code and it says up front in comment (emphasis mine):

Regex used to guess url-like strings. These are not expected to be 100% correct, we accept some user mistypes and we're unlikely to be able to cover 100% of the cases.

This gives more clarity on the matter, thank you for your help. — ypdev19, Feb 21 '23 at 13:51

Issue with an url validation in Java

1 Answers1