2

I'd like to replace the below regex with a unicode-friendly version that will catch things like http://➡.ws and other non-ascii IRIs. The purpose is to grab these out of users' text and encode and html-ize them into real links.

Python provides a re.UNICODE flag which changes the meaning of \w, but that's not super helpful in this case (that I can see) because it is defined as "alphanumeric characters and underscore" and not all of my below character classes include underscore.

domain_regex = re.compile(r"""
    (
        (https?://)
        (
            [0-9a-zA-Z]
            [0-9a-zA-Z_-]*
            \.
        )+
        [a-zA-Z]{2,4}
    )
    | # begins with an http scheme followed by a domain, or
    (
        (?<!   # negative look-behind
            [0-9a-zA-Z.@-]
        )
        (
            [0-9a-zA-Z]
            [0-9a-zA-Z_-]*
            \.
        )+
        # top-level domain names
        com|ca|net|org|edu|gov|biz|info|mobi|name|
        us|uk|fr|au|be|ch|de|es|eu|it|tv|cn|jp
    )
""", re.VERBOSE)

More non-ascii domains:

Community
  • 1
  • 1
bukzor
  • 37,539
  • 11
  • 77
  • 111
  • This is a possible duplicate of http://stackoverflow.com/questions/1832893/python-regex-matching-unicode-properties Let us know if you need any more help – buckley Mar 22 '12 at 22:08

2 Answers2

5

If you want to write "\w except underscore" you can do so using a negated character class:

[^\W_]
Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
0

As buckley noted, "Python regex matching Unicode properties" presents some alternatives to use regex + unicode in Python. If what you want is just alphanumeric, alphanumeric + underscore or letters only, maybe it's easier to stick with Mark Byers suggestion ([^\W_], \w and [^\W\d_] respectively, with re.UNICODE active; Edit: got the order wrong...).

Otherwise, look up which character classes are valid as a IRI part and either use a regex engine that supports unicode character classes, or - if you need a pure python solution - I'd suggest the code I provided in an answer to that question (or a similar solution).

Community
  • 1
  • 1
mgibsonbr
  • 21,755
  • 7
  • 70
  • 112