0

I want regex to match web addresses such as http://www.example.com, example.co.uk, en.example.com etc. I've been using ^(https?://|www\.|)[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$ and testing it on http://regexpal.com/, and it seems to work exactly as it should.

However, when I put it in autohotkey, it seems to match extra things like example and example.something, when it shouldn't. It then doesn't match things like example.com/something and example.com/something.html when it should.

If RegExMatch(Clipboard, "^(https?://|www\.|)[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$")
    Msgbox, it matches
else
    Msgbox, it doesn't
Sam K.
  • 35
  • 5

1 Answers1

1

Matching URLs, host names etc is a problem solved many times; I suggest you adapt some standard regex. Perhaps SO question: Fully qualified domain name validation is helpful.


If you're composing the regex as an exercise:

Does it really match the string example? You firmly assert the string to contain a ., so it never should. Maybe AHK doesn't escape . the standard way?

If [a-zA-Z]{2,3} should match top level domain, you forgot about .info.

You may want to allow strings of whitespace of arbitrary length at the end and beginning, if you accidentally copied some such into the clipboard. I.e. ^\s*your-regex-thingy\s*$

example.something is a match, because it begins with the empty string, follows with a sequence of 1 or more alphanumerics (or -, .), one ., 2 or 3 letters, and ends with a sequence of non-whitespace.

example.com/something.html might fail to match if the entire substring example.com is matched by the group [a-zA-Z0-9\-\.]+. It shouldn't if the regex engine is correctly implemented, though. Perhaps you need to escape +, | or some such, engines have varying conventions on such (i.e. sed and pcre have differing opinions on + and ( if I'm not mistaken.

Community
  • 1
  • 1
sapht
  • 2,789
  • 18
  • 16
  • I would go even as far as allowing not just whitespaces, but other *noise* sorrounding the URL in the clipboard. In other words, I would simply remove the `^` and `$`. You never know what browsers or word processors actually do if you copy stuff, especially if they are from Microsoft ;) For instance, JavaScript can *"hijack"* your clipboard, here's an [example](http://www.firstpost.com/politics/volunteers-or-vigilantes-the-perils-of-aaps-anarchic-politics-1328297.html). Try to copy something from the news and paste it somewhere. – MCL Feb 15 '14 at 15:08
  • Thanks! Turns out it *was* an escaping problem. I'd changed the escape character to /, so I needed two of them to get the regex to work properly. – Sam K. Feb 15 '14 at 17:12