I'd like to replace the below regex with a unicode-friendly version that will catch things like http://➡.ws and other non-ascii IRIs. The purpose is to grab these out of users' text and encode and html-ize them into real links.
Python provides a re.UNICODE flag which changes the meaning of \w, but that's not super helpful in this case (that I can see) because it is defined as "alphanumeric characters and underscore" and not all of my below character classes include underscore.
domain_regex = re.compile(r"""
(
(https?://)
(
[0-9a-zA-Z]
[0-9a-zA-Z_-]*
\.
)+
[a-zA-Z]{2,4}
)
| # begins with an http scheme followed by a domain, or
(
(?<! # negative look-behind
[0-9a-zA-Z.@-]
)
(
[0-9a-zA-Z]
[0-9a-zA-Z_-]*
\.
)+
# top-level domain names
com|ca|net|org|edu|gov|biz|info|mobi|name|
us|uk|fr|au|be|ch|de|es|eu|it|tv|cn|jp
)
""", re.VERBOSE)
More non-ascii domains:
- Bücher.ch -- (swiss-german "books". Currently down.)
- http://παράδειγμα.δοκιμή
- http://실례.테스트