Checking if string contains unicode using standard Python

Question

I have some strings of roughly 100 characters and I need to detect if each string contains an unicode character. The final purpose is to check if some particular emojis are present, but initially I just want a filter that catches all emojis (as well as potentially other special characters). This method should be fast.

I've seen Python regex matching Unicode properties, but I cannot use any custom packages. I'm using Python 2.7. Thanks!

**All** characters are Unicode characters. The simple test would be `if string:`; just test for non-empty strings. Any character Python can put in a string is part of the Unicode standard. — Martijn Pieters, Sep 19 '16 at 18:25
Perhaps you meant to test for *non-ASCII codepoints* or something similar? — Martijn Pieters, Sep 19 '16 at 18:26
Are you just checking for emoji's? Technically, all the ASCII characters are also present in unicode as well, so you need to be a little more specific when you say you're "checking for unicode characters". — Brendan Abel, Sep 19 '16 at 18:27
I would highly recommend reading this primer on unicode -- http://www.joelonsoftware.com/articles/Unicode.html — Brendan Abel, Sep 19 '16 at 18:27
@sln: not quite. This post looks like a dupe of [Is there a specific range of unicode code points which can be checked for emojis?](http://stackoverflow.com/q/38730560) at this point. — Martijn Pieters, Sep 19 '16 at 18:28
Also, is this for Python 2 or 3? If for a Python release before 3.3 (so including Python 2), do you have access to a wide build (if `sys.maxunicode` is equal to `0x1FFFF` you have a wide build)? This matters because in a *narrow* build Unicode codepoints over U+FFFF take up two code-units each and are harder to test for. And that's where there are a lot of Emoji codepoints. — Martijn Pieters, Sep 19 '16 at 18:32
Woaw.. I'll have to read up on this. Thanks for the material! — pir, Sep 19 '16 at 18:40
I'm converting that table to a character class, will post it. — , Sep 19 '16 at 18:44

score 1 · Accepted Answer · edited May 23 '17 at 12:32

1

There is no point is testing 'if a string contains Unicode characters', because all characters in a string are Unicode characters. The Unicode standard encompasses all codepoints that Python supports, including the ASCII range (Unicode codepoints U+0000 through to U+007F).

If you want to test for Emoji code, test for specific ranges, as outlined by the Unicode Emoji class specification:

re.compile(
    u'[\u231A-\u231B\u2328\u23CF\23E9-\u23F3...\U0001F9C0]',
    flags=re.UNICODE)

where you'll have to pick and choose what codepoints you consider to be Emoji. I personally would not include U+0023 NUMBER SIGN in that category for example, but apparently the Unicode standard does.

Note: To be explicit, the above expression is not complete. There are 209 separate entries in the Emoji category and I didn't feel like writing them all out.

Another note: the above uses a \Uhhhhhhhh wide Unicode escape sequence; its use is only supported in a regex pattern in Python 3.3 and up, or in a wide (UCS-4) build for earlier versions of Python. For a narrow Python build, you'll have to match on surrogate pairs for codepoints over U+FFFF.

edited May 23 '17 at 12:32

Community

1
1

answered Sep 19 '16 at 18:37

Martijn Pieters

1,048,767
296
4,058
3,343

Thanks, that makes sense. Now if I only want to do a rough version that catches all emojis, but also other strings, would `return '\' in mystr` on some encoding of the string that would reveal all those backslashes then work? – pir Sep 19 '16 at 18:42
I'm in Python 2.7 – pir Sep 19 '16 at 18:44
@pir: there are no backslashes in strings; you can't test for escape sequences because escape sequences are just a way to make it easier to specify a specific codepoint. – Martijn Pieters Sep 19 '16 at 18:44
Isn't there some way in Python to reveal these escape sequences? – pir Sep 19 '16 at 18:46
No, because *all characters* can be expressed with either an escape sequence or the literal value, given the right source code encoding. – Martijn Pieters Sep 19 '16 at 18:48
The strings `u'å'` and `u'\u00e5'` and `u'\xe5'` are all equivalent. – Martijn Pieters Sep 19 '16 at 18:50
@pir: if you mean *non-ASCII* codepoints, see [Replace non-ASCII characters with a single space](https://stackoverflow.com/q/20078816) – Martijn Pieters Sep 19 '16 at 18:51

Checking if string contains unicode using standard Python

1 Answers1