1

env python3.6 There's a utf-8 encoded text like this

text_utf8 = b"\xf0\x9f\x98\x80\xef\xbc\x81\xef\xbc\x81\xef\xbc\x81"

And I want to search only elements which three numbers or alphabets follow b'\xf0\x9f\x98\' - this actually indicates the facial expression emojis.

I tried this

if re.search(b'\xf0\x9f\x98\[a-zA-Z0-9]{3}$', text_utf8)

but it doesn't work and when I print it off it comes like this b'\xf0\x9f\x98\\[a-zA-Z1-9]{3}' and \ automatically gets in it. Any way out? thanks.

user9191983
  • 505
  • 1
  • 4
  • 20

1 Answers1

2

I can see two problems with your search:

  1. you are trying to search the textual representation of the utf8 string (the \xXX represents a byte in hexadecimal). What you actually should be doing is matching against its content (the actual bytes).
  2. you are including the "end-of-string" marker ($) in your search, where you're probably interested in its occurrence anywhere in the string.

Something like the following should work, though brittle (see below for a more robust solution):

re.search(b'\xf0\x9f\x98.', text_utf8)

This will give you the first occurrence of a 4-byte unicode sequences prefixed by \xf0\x9f\x98.

Assuming you're dealing only with UTF-8, this should TTBOMK have unambiguous matches (i.e.: you don't have to worry about this prefix appearing in the middle of a longer sequence).


A more robust solution, if you have the option of third-party modules, would be installing the regex module and using the following:

regex.search('\p{Emoji=Yes}', text_utf8.decode('utf8'))

This has the advantages of being more readable and explicit, while probably being also more future-proof. (See here for more unicode properties that might help in your use-case)

Note that in this case you can also deal with text_utf8 as an actual unicode (str in py3) string, without converting it to a byte-string, which might have other advantages, depending on the rest of your code.

Leo Antunes
  • 689
  • 7
  • 11