1

I have a city name in unicode, and I want to match it with regex, but I also want to validate when it is a string, like "New York". I searched a little bit and tried something attached below, but could not figure out how?

I tried this regex "([\u0000-\uFFFF]+)" on this website:http://regex101.com/#python and it works, but could not get it working in python.

Thanks in advance!!

city=u"H\u0101na"
mcity=re.search(r"([\u0000-\uFFFFA-Za-z\s]+)", city, re.U)
mcity.group(0)
u'H'
K DawG
  • 13,287
  • 9
  • 35
  • 66
amstree
  • 537
  • 1
  • 4
  • 12
  • 2
    Your regex must be a unicode itself: `ur"..."`. But that isn't a good regex. – georg Dec 30 '13 at 17:44
  • possible duplicate of [How do I specify a range of unicode characters](http://stackoverflow.com/questions/3835917/how-do-i-specify-a-range-of-unicode-characters) – Iguananaut Dec 30 '13 at 17:45
  • 1
    Also, specifying such a wide range of characters means that your regex will match just about any character it's likely to encounter, including all ASCII characters. Depending on what exactly you're trying to match you might need to use a more restricted range of characters. – Iguananaut Dec 30 '13 at 17:49

1 Answers1

1
mcity=re.search(r"([\u0000-\uFFFFA-Za-z\s]+)", city, re.U)

Unlike \x, \u is not a special sequence in regex syntax, so your character group matches a literal backslash, letter U, and so on.

To refer to non-ASCII in a regex you have to include them as raw characters in a Unicode string, for example as:

mcity=re.search(u"([\u0000-\uFFFFA-Za-z\\s]+)", city, re.U)

(If you don't want to double-backslash the \s, you could also use a ur string, in which \u still works as an escape but the other escapes like \x don't. This is a bit confusing though.)

This character group is redundant: including the range U+0000 to U+FFFF already covers all of A-Za-z\s, and indeed the whole Basic Multilingual Plane including control characters. On a narrow build of Python (including Windows Python 2 builds), where the characters outside the BMP are represented using surrogate pairs in the range U+D800 to U+DFFF, you are actually allowing every single character, so it's not much of a filter. (.+ would be a simpler way of putting it.)

Then again it's pretty difficult to express what might constitute a valid town name in different parts of the world. I'd be tempted to accept anything that, shorn of control characters and leading/trailing whitespace, wasn't an empty string.

bobince
  • 528,062
  • 107
  • 651
  • 834