python unicode regex

Question

I'd like to replace the below regex with a unicode-friendly version that will catch things like http://➡.ws and other non-ascii IRIs. The purpose is to grab these out of users' text and encode and html-ize them into real links.

Python provides a re.UNICODE flag which changes the meaning of \w, but that's not super helpful in this case (that I can see) because it is defined as "alphanumeric characters and underscore" and not all of my below character classes include underscore.

domain_regex = re.compile(r"""
    (
        (https?://)
        (
            [0-9a-zA-Z]
            [0-9a-zA-Z_-]*
            \.
        )+
        [a-zA-Z]{2,4}
    )
    | # begins with an http scheme followed by a domain, or
    (
        (?<!   # negative look-behind
            [0-9a-zA-Z.@-]
        )
        (
            [0-9a-zA-Z]
            [0-9a-zA-Z_-]*
            \.
        )+
        # top-level domain names
        com|ca|net|org|edu|gov|biz|info|mobi|name|
        us|uk|fr|au|be|ch|de|es|eu|it|tv|cn|jp
    )
""", re.VERBOSE)

More non-ascii domains:

Bücher.ch -- (swiss-german "books". Currently down.)
http://παράδειγμα.δοκιμή
http://실례.테스트

This is a possible duplicate of http://stackoverflow.com/questions/1832893/python-regex-matching-unicode-properties Let us know if you need any more help — buckley, Mar 22 '12 at 22:08

score 5 · Answer 1 · answered Mar 22 '12 at 22:05

5

If you want to write "\w except underscore" you can do so using a negated character class:

[^\W_]

answered Mar 22 '12 at 22:05

Mark Byers

811,555
193
1,581
1,452

score 0 · Answer 2 · edited May 23 '17 at 10:29

As buckley noted, "Python regex matching Unicode properties" presents some alternatives to use regex + unicode in Python. If what you want is just alphanumeric, alphanumeric + underscore or letters only, maybe it's easier to stick with Mark Byers suggestion ([^\W_], \w and [^\W\d_] respectively, with re.UNICODE active; Edit: got the order wrong...).

Otherwise, look up which character classes are valid as a IRI part and either use a regex engine that supports unicode character classes, or - if you need a pure python solution - I'd suggest the code I provided in an answer to that question (or a similar solution).

python unicode regex

2 Answers2