0

I have some strings of roughly 100 characters and I need to detect if each string contains an unicode character. The final purpose is to check if some particular emojis are present, but initially I just want a filter that catches all emojis (as well as potentially other special characters). This method should be fast.

I've seen Python regex matching Unicode properties, but I cannot use any custom packages. I'm using Python 2.7. Thanks!

Community
  • 1
  • 1
pir
  • 5,513
  • 12
  • 63
  • 101
  • 1
    **All** characters are Unicode characters. The simple test would be `if string:`; just test for non-empty strings. Any character Python can put in a string is part of the Unicode standard. – Martijn Pieters Sep 19 '16 at 18:25
  • Perhaps you meant to test for *non-ASCII codepoints* or something similar? – Martijn Pieters Sep 19 '16 at 18:26
  • Are you just checking for emoji's? Technically, all the ASCII characters are also present in unicode as well, so you need to be a little more specific when you say you're "checking for unicode characters". – Brendan Abel Sep 19 '16 at 18:27
  • 1
    I would highly recommend reading this primer on unicode -- http://www.joelonsoftware.com/articles/Unicode.html – Brendan Abel Sep 19 '16 at 18:27
  • Codepoints `[\u0100-\U0001ffff]` –  Sep 19 '16 at 18:27
  • 3
    @sln: not quite. This post looks like a dupe of [Is there a specific range of unicode code points which can be checked for emojis?](http://stackoverflow.com/q/38730560) at this point. – Martijn Pieters Sep 19 '16 at 18:28
  • Also, is this for Python 2 or 3? If for a Python release before 3.3 (so including Python 2), do you have access to a wide build (if `sys.maxunicode` is equal to `0x1FFFF` you have a wide build)? This matters because in a *narrow* build Unicode codepoints over U+FFFF take up two code-units each and are harder to test for. And that's where there are a lot of Emoji codepoints. – Martijn Pieters Sep 19 '16 at 18:32
  • `Total elements: 910` good luck with that one. –  Sep 19 '16 at 18:33
  • Woaw.. I'll have to read up on this. Thanks for the material! – pir Sep 19 '16 at 18:40
  • I'm converting that table to a character class, will post it. –  Sep 19 '16 at 18:44

1 Answers1

1

There is no point is testing 'if a string contains Unicode characters', because all characters in a string are Unicode characters. The Unicode standard encompasses all codepoints that Python supports, including the ASCII range (Unicode codepoints U+0000 through to U+007F).

If you want to test for Emoji code, test for specific ranges, as outlined by the Unicode Emoji class specification:

re.compile(
    u'[\u231A-\u231B\u2328\u23CF\23E9-\u23F3...\U0001F9C0]',
    flags=re.UNICODE)

where you'll have to pick and choose what codepoints you consider to be Emoji. I personally would not include U+0023 NUMBER SIGN in that category for example, but apparently the Unicode standard does.

Note: To be explicit, the above expression is not complete. There are 209 separate entries in the Emoji category and I didn't feel like writing them all out.

Another note: the above uses a \Uhhhhhhhh wide Unicode escape sequence; its use is only supported in a regex pattern in Python 3.3 and up, or in a wide (UCS-4) build for earlier versions of Python. For a narrow Python build, you'll have to match on surrogate pairs for codepoints over U+FFFF.

Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thanks, that makes sense. Now if I only want to do a rough version that catches all emojis, but also other strings, would `return '\' in mystr` on some encoding of the string that would reveal all those backslashes then work? – pir Sep 19 '16 at 18:42
  • I'm in Python 2.7 – pir Sep 19 '16 at 18:44
  • @pir: there are no backslashes in strings; you can't test for escape sequences because escape sequences are just a way to make it easier to specify a specific codepoint. – Martijn Pieters Sep 19 '16 at 18:44
  • Isn't there some way in Python to reveal these escape sequences? – pir Sep 19 '16 at 18:46
  • No, because *all characters* can be expressed with either an escape sequence or the literal value, given the right source code encoding. – Martijn Pieters Sep 19 '16 at 18:48
  • The strings `u'å'` and `u'\u00e5'` and `u'\xe5'` are all equivalent. – Martijn Pieters Sep 19 '16 at 18:50
  • @pir: if you mean *non-ASCII* codepoints, see [Replace non-ASCII characters with a single space](https://stackoverflow.com/q/20078816) – Martijn Pieters Sep 19 '16 at 18:51