1

I was experimenting with some Python (2.7.3) regex and I came across this behavior which I did not expect. In this block of code here, the following will return False when checking against the "ß" character as well as other accented characters like "Å", "Í", etc.

In addition to returning False for the "ø" character, it will also return False with other accented characters such as "å", "Å", "ç", "Ç", "Â", etc.

Case and point, I'm not sure where the problem stems from when dealing with accented characters versus other characters like "¥", which it has no problem with. They all have different unicode/utf-8 values (which is what my encoding is set to), so I'm not sure where the difference lies.

def regex_check(name)
    pattern = '[^ß]'
    if re.match(pattern, str(name), re.IGNORECASE):
        return True
    else:
        return False

print regex_check("ø") 

Am I missing something obvious? Thanks for the help.

Friendly King
  • 2,396
  • 1
  • 23
  • 40

1 Answers1

3

Normal strings are bytes in Python 2, you should use the u'...' prefix to treat them as unicode strings.

# -*- coding: utf-8 -*-
import re
def regex_check(name):
    pattern = u'[^ß]'    #use u'...' here  
    if re.match(pattern, name , re.IGNORECASE):
        return True
    else:
        return False

print regex_check(u"ø")  #use u'...' here

output:

True
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504