Python: How to check if a unicode string contains a cased character?

Question

I'm doing a filter wherein I check if a unicode (utf-8 encoding) string contains no uppercase characters (in all languages). It's fine with me if the string doesn't contain any cased character at all.

For example: 'Hello!' will not pass the filter, but "!" should pass the filter, since "!" is not a cased character.

I planned to use the islower() method, but in the example above, "!".islower() will return False.

According to the Python Docs, "The python unicode method islower() returns True if the unicode string's cased characters are all lowercase and the string contained at least one cased character, otherwise, it returns False."

Since the method also returns False when the string doesn't contain any cased character, ie. "!", I want to do check if the string contains any cased character at all.

Something like this....

string = unicode("!@#$%^", 'utf-8')

#check first if it contains cased characters
if not contains_cased(string):
     return True

return string.islower():

Any suggestions for a contains_cased() function?

Or probably a different implementation approach?

Thanks!

The answer that you have accepted appears to be incorrect. See my answer. — John Machin, Aug 18 '10 at 22:09

score 8 · Answer 1 · answered Aug 18 '10 at 02:25

8

import unicodedata as ud

def contains_cased(u):
  return any(ud.category(c)[0] == 'L' for c in u)

answered Aug 18 '10 at 02:25

Alex Martelli

854,459
170
1,222
1,395

Arg alex, is there something you don't know ? – Bite code Aug 18 '10 at 07:19
+1 : working solution (compared to the nice explanation without executable code of John Machin) – oDDsKooL Jul 03 '12 at 08:45

score 8 · Accepted Answer · answered Aug 18 '10 at 08:08

8

Here is the full scoop on Unicode character categories.

Letter categories include:

Ll -- lowercase
Lu -- uppercase
Lt -- titlecase
Lm -- modifier
Lo -- other

Note that Ll <-> islower(); similarly for Lu; (Lu or Lt) <-> istitle()

You may wish to read the complicated discussion on casing, which includes some discussion of Lm letters.

Blindly treating all "letters" as cased is demonstrably wrong. The Lo category includes 45301 codepoints in the BMP (counted using Python 2.6). A large chunk of these would be Hangul Syllables, CJK Ideographs, and other East Asian characters -- very hard to understand how they might be considered "cased".

You might like to consider an alternative definition, based on the (unspecified) behaviour of "cased characters" that you expect. Here's a simple first attempt:

>>> cased = lambda c: c.upper() != c or c.lower() != c
>>> sum(cased(unichr(i)) for i in xrange(65536))
1970
>>>

Interestingly there are 1216 x Ll and 937 x Lu, a total of 2153 ... scope for further investigation of what Ll and Lu really mean.

answered Aug 18 '10 at 08:08

John Machin

81,303
11
141
189

@John: Wow. Thanks for your explanation. It took me a while to understand it. I took a look at your link, and I think I have to study it more extensively. I have a feeling that what I'm going to find out is going to make me overhaul a lot of my code. Yikes. Thanks! – Albert Aug 19 '10 at 05:40
@Albert: Don't panic. As I've hinted, firstly develop a definition of what you mean by "cased". What different treatment will you apply to cased chars as opposed to uncased chars? My example definition was "char which has an uppercase or lowercase 'partner'". Some (maybe all) of the difference between the 1970 chars and the 2153 appears to be due to chars which are classified as `Ll` because they look like a lowercase character, but don't have a `Lu` partner, and vice versa -- you need to decided whether these are "cased" for your purposes. BTW you can change your accepted answer :-) – John Machin Aug 19 '10 at 06:07
@John: Well, I'm actually making an API for my web service. My webservice accepts a key that maps out to a specific record in my database. The key is case-sensitive, and the key can be composed of any unicode characteer. So in order to normalize all input, I will convert all key queries into lowercase (if they have uppercase equivalents). A consequence of that is when I create the record keys (which my users can customize), I cannot accept any uppercase character that can be converted to a lowercase equivalent by the toLower() function. So I'm trying to make a filter for that. Any suggestions? – Albert Aug 20 '10 at 12:54
@Albert: If your keys are case sensitive, why are you normalising them??? "record keys which users can customize" means what??? "any unicode char" vs "cannot accept any uppercase char" ??? To answer your question literally: Looks like you can't accept a character c when `c.lower() != c` which means that you can't accept any key if `key.lower() != key`. I think that you should start a NEW QUESTION, explaining exactly what you are trying to do, with examples. BTW1: don't forget to accept an answer to this question first. BTW2: Python doesn't have a `toLower` function ... – John Machin Aug 20 '10 at 22:29
@John: My mistake. I meant lower() function. Alright, I'll start a new question. Thanks! – Albert Aug 21 '10 at 00:35
@John: I respect your expertise in unicode. I have a new question, do you think you can take a look at it, and also at the answers, if they are correct. Thanks! http://stackoverflow.com/questions/3536397/does-python-version-2-5-2-follow-unicode-standards-for-lower-and-upper-functi – Albert Aug 21 '10 at 05:04
You’ve mistaken lowercase *letters* for lowercase *code points*. These are lowercase code points but not lowercase letters: U+0345 `GC=Mn` `COMBINING GREEK YPOGEGRAMMENI`, U+2176 `GC=Nl` `SMALL ROMAN NUMERAL SEVEN`, U+24DA `GC=So` `CIRCLED LATIN SMALL LETTER K`. And these are lowercase that don’t change case when uppercased: U+00AA `GC=Ll` `FEMININE ORDINAL INDICATOR`, U+0262 `GC=Ll` `LATIN LETTER SMALL CAPITAL G`, U+02B0 `GC=Lm` `MODIFIER LETTER SMALL H`, U+2093 `GC=Lm` `LATIN SUBSCRIPT SMALL LETTER X`, U+210A `GC=Ll` `SCRIPT SMALL G`, and U+1D521 `GC=Ll` `MATHEMATICAL FRAKTUR SMALL D`. – tchrist Aug 27 '11 at 13:02

score 1 · Answer 3 · answered Aug 18 '10 at 02:27

1

use module unicodedata,

unicodedata.category(character)

returns "Ll" for lowercase letters and "Lu" for uppercase ones.

here you can find list of unicode character categories

answered Aug 18 '10 at 02:27

mykhal

19,175
11
72
80

Python: How to check if a unicode string contains a cased character?

3 Answers3

Linked