Python Regex strange behavior with accented characters

Question

I was experimenting with some Python (2.7.3) regex and I came across this behavior which I did not expect. In this block of code here, the following will return False when checking against the "ß" character as well as other accented characters like "Å", "Í", etc.

In addition to returning False for the "ø" character, it will also return False with other accented characters such as "å", "Å", "ç", "Ç", "Â", etc.

Case and point, I'm not sure where the problem stems from when dealing with accented characters versus other characters like "¥", which it has no problem with. They all have different unicode/utf-8 values (which is what my encoding is set to), so I'm not sure where the difference lies.

def regex_check(name)
    pattern = '[^ß]'
    if re.match(pattern, str(name), re.IGNORECASE):
        return True
    else:
        return False

print regex_check("ø")

Am I missing something obvious? Thanks for the help.

Ashwini Chaudhary · Accepted Answer · 2015-12-27T09:04:15.147

3

Normal strings are bytes in Python 2, you should use the u'...' prefix to treat them as unicode strings.

# -*- coding: utf-8 -*-
import re
def regex_check(name):
    pattern = u'[^ß]'    #use u'...' here  
    if re.match(pattern, name , re.IGNORECASE):
        return True
    else:
        return False

print regex_check(u"ø")  #use u'...' here

output:

True

edited Dec 27 '15 at 09:04

answered Sep 07 '13 at 18:00

Ashwini Chaudhary

244,495
58
464
504

The re.UNICODE flag shouldn't have any effect here. It should return True regardless. – nhahtdh Sep 07 '13 at 18:09
1

@nhahtdh You're right, properly file encoding and using u'...' will handle everything. – Ashwini Chaudhary Sep 07 '13 at 18:11
I think you should explain that in your answer (rather than just code). – nhahtdh Sep 07 '13 at 18:12
@AshwiniChaudhary: Shouldn't it give a decoding error instead of failed regex if file encoding is absent or file encoding is ascii? I think the issue is just that the string should be declared as unicode. – Abhijit Sep 07 '13 at 18:15
@Abhijit OP has already set their encoding to `utf-8`, yes the problem was `u'..'` thing only. – Ashwini Chaudhary Sep 07 '13 at 18:17

Python Regex strange behavior with accented characters

1 Answers1

Linked