I can see two problems with your search:
- you are trying to search the textual representation of the utf8 string (the
\xXX
represents a byte in hexadecimal). What you actually should be doing is matching against its content (the actual bytes).
- you are including the "end-of-string" marker (
$
) in your search, where you're probably interested in its occurrence anywhere in the string.
Something like the following should work, though brittle (see below for a more robust solution):
re.search(b'\xf0\x9f\x98.', text_utf8)
This will give you the first occurrence of a 4-byte unicode sequences prefixed by \xf0\x9f\x98
.
Assuming you're dealing only with UTF-8, this should TTBOMK have unambiguous matches (i.e.: you don't have to worry about this prefix appearing in the middle of a longer sequence).
A more robust solution, if you have the option of third-party modules, would be installing the regex module and using the following:
regex.search('\p{Emoji=Yes}', text_utf8.decode('utf8'))
This has the advantages of being more readable and explicit, while probably being also more future-proof. (See here for more unicode properties that might help in your use-case)
Note that in this case you can also deal with text_utf8
as an actual unicode
(str
in py3) string, without converting it to a byte-string, which might have other advantages, depending on the rest of your code.