mcity=re.search(r"([\u0000-\uFFFFA-Za-z\s]+)", city, re.U)
Unlike \x
, \u
is not a special sequence in regex syntax, so your character group matches a literal backslash, letter U, and so on.
To refer to non-ASCII in a regex you have to include them as raw characters in a Unicode string, for example as:
mcity=re.search(u"([\u0000-\uFFFFA-Za-z\\s]+)", city, re.U)
(If you don't want to double-backslash the \s
, you could also use a ur
string, in which \u
still works as an escape but the other escapes like \x
don't. This is a bit confusing though.)
This character group is redundant: including the range U+0000 to U+FFFF already covers all of A-Za-z\s
, and indeed the whole Basic Multilingual Plane including control characters. On a narrow build of Python (including Windows Python 2 builds), where the characters outside the BMP are represented using surrogate pairs in the range U+D800 to U+DFFF, you are actually allowing every single character, so it's not much of a filter. (.+
would be a simpler way of putting it.)
Then again it's pretty difficult to express what might constitute a valid town name in different parts of the world. I'd be tempted to accept anything that, shorn of control characters and leading/trailing whitespace, wasn't an empty string.