I want to validate that given strings are URLs. Matching URLs in text would be nice too, but not required. I've searched and experimented but so far I did not find something that answers these requirements:
Must not accept strings which, when treated as links, pose a security risk. For example,
<a href="javascript:alert(document.cookie)">clickme</a>
is a valid HTML element and indeed works (raises an alert and so on) in at least some browsers. I'm concerned that if I allow arbitrary schemes (see below) it can compromise security (as noted, for example, here: What is the best regular expression to check if a string is a valid URL?).Must work correctly in JavaScript.
Would be nice if it worked the same in Java -- I'm developing in GWT, so this would be nice but not strictly necessary.
Must accept URLs which are used in practice, and not only standard-compliant URLs. Specific examples:
a. I want to accept http://fr.wikipedia.org/wiki/Français, which is non-standard because of the non-English character, but accepted by my reference browsers IE(7+) and Chrome.
b. I want to accept http://fr.wikipedia.org/wiki/Fran%c3%a7ais, which is non-standard because percent-encoding hex should be uppercase, but again accepted by IE and Chrome. I guess I could just do a case-insensitive match -- any downside you can think of?
c. I want to accept http://localhost/localpath/servlet#action?param=value, which is non-standard because the fragment part (from '#' to the end) should not include '?' and other chars, but there are apps which generate such URLs and browsers accept them.
d. I want to accept URLs with any scheme/protocol (not just http, https and ftp), because all kinds of apps I integrate with and their users may need to pass such URLs. I can forbid 'javascript:' and allow everything else; if you think this would compromise security please say so.
There is a ton of questions on this topic in SO and elsewhere, but I did not find a regex which answers all of my requirements. Examples:
Regex in GWT to match URLs -- Pretty good and simple regex, but doesn't accept non-standard URLs. I can handle the scheme part and the percent-encoding case-sensitivity, but not the other issues.
https://stackoverflow.com/a/190405/96929 -- Giant regex (I ask myself if all browsers and frameworks I use can handle this size) which appears to be very comprehensive, but says it conforms to standard and I can't make heads or tails of it.
Thanks! :-)