How do I parse a unicode 'string' for characters greater than \uFFFF
?
tried re
and regex
but does not appear to properly match unicode characters that are greater than 2 hex values.
Take any unicode string (for example, a tweet text which is encoded in utf-8
)
emotes = regex.findall('[\u263A\u263B\u062A\u32E1]',tweet_json_obj['text'])
if emotes: print "Happy:{0}".format(len(emotes))
Output is the number of smiley faces contained within the text, it works great!
but if I try to match for the emoticon set of unicode characters: http://www.fileformat.info/info/unicode/block/emoticons/index.htm
emotes = regex.findall('[\u01F600-\u01F64F]',tweet_json_obj['text'])
if emotes: print "Emoticon:{0}".format(len(emotes))
output is the (number) match for all the characters in the string, minus white spaces. How is it that regex is matching every character in the tweet, or at least what looks like string.printable?
Expected results are a return of 0 for a majority of the dataset, as I don't expect people to be inserting these emoticons, but they might... so I'd like to check for their existence. Is my regex incorrect?