how can I extract only emoji from utf-8 with regex in python?

Question

env python3.6 There's a utf-8 encoded text like this

text_utf8 = b"\xf0\x9f\x98\x80\xef\xbc\x81\xef\xbc\x81\xef\xbc\x81"

And I want to search only elements which three numbers or alphabets follow b'\xf0\x9f\x98\' - this actually indicates the facial expression emojis.

I tried this

if re.search(b'\xf0\x9f\x98\[a-zA-Z0-9]{3}$', text_utf8)

but it doesn't work and when I print it off it comes like this b'\xf0\x9f\x98\\[a-zA-Z1-9]{3}' and \ automatically gets in it. Any way out? thanks.

You escaped `[` - is that intentional? That ruins the character class. — Wiktor Stribiżew, Jun 12 '19 at 11:19
Does `if re.search(b'\xf0\x9f\x98[a-zA-Z0-9]{3}$', text_utf8)` work? — Wiktor Stribiżew, Jun 12 '19 at 11:29
sorry I a bit modified my question. yeah it makes no error actually, it says it only doesnt match — user9191983, Jun 12 '19 at 11:35
I just converted `\x80\xef\xbc\x81\xef\xbc\x81\xef\xbc\x81` and it shows `ï¼ï¼ï¼` - you can't match them with `[a-zA-Z0-9]`, these are not alphanumerics. — Wiktor Stribiżew, Jun 12 '19 at 11:39
thanks. I just want to match only this part ```\xf0\x9f\x98\x80``` — user9191983, Jun 12 '19 at 11:47
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/194825/discussion-between-user9191983-and-wiktor-stribizew). — user9191983, Jun 12 '19 at 11:55

Leo Antunes · Answer 1 · 2019-06-12T12:53:22.287

I can see two problems with your search:

you are trying to search the textual representation of the utf8 string (the \xXX represents a byte in hexadecimal). What you actually should be doing is matching against its content (the actual bytes).
you are including the "end-of-string" marker ($) in your search, where you're probably interested in its occurrence anywhere in the string.

Something like the following should work, though brittle (see below for a more robust solution):

re.search(b'\xf0\x9f\x98.', text_utf8)

This will give you the first occurrence of a 4-byte unicode sequences prefixed by \xf0\x9f\x98.

Assuming you're dealing only with UTF-8, this should TTBOMK have unambiguous matches (i.e.: you don't have to worry about this prefix appearing in the middle of a longer sequence).

A more robust solution, if you have the option of third-party modules, would be installing the regex module and using the following:

regex.search('\p{Emoji=Yes}', text_utf8.decode('utf8'))

This has the advantages of being more readable and explicit, while probably being also more future-proof. (See here for more unicode properties that might help in your use-case)

Note that in this case you can also deal with text_utf8 as an actual unicode (str in py3) string, without converting it to a byte-string, which might have other advantages, depending on the rest of your code.

Thanks! it worked for the given case in the question but how can I avoid numbers? it actually gives priority to numbers and it doesnt pick up only emojis... — user9191983, Jun 13 '19 at 04:44

how can I extract only emoji from utf-8 with regex in python?

1 Answers1