Python3 regex: Keep some Emojis, discard the rest

Question

noob here. I have strings where I want to keep some emoji and to discard the rest.

INPUT:

This book is so funny❤️. This book is the bomb(AS IN THE BEST IN THE WORLD )I love it!I definitely recommend it!'

DESIRED OUTPUT:

This book is so funny❤️. This book is the bomb(AS IN THE BEST IN THE WORLD )I love it!I definitely recommend it!'

I have the re.compile that matches:

my emoji
all emoji Removing Emoticons from..... See David Mabodo answer

I don't know how to put it together in re.compile that excludes one from the other. Alternatively keep alphanumeric, punctuation, and my emoji, and substitute the rest to "".

mytext = This book is so funny❤️. This book  is the bomb(AS IN THE BEST 
IN THE WORLD   )I love    it!I definitely recommend it!'
# Desired out put:
# u'This book is so funny❤️. This book is the bomb(AS IN THE BEST 
IN THE WORLD )I love    it!I definitely recommend it!'
print ("Original text:")
print (mytext, "\n")

# Strip out emoticon modifiers, leaving a simplified emoticon to work with.
# https://en.wikipedia.org/wiki/Variation_Selectors_(Unicode_block)
# https://en.wikipedia.org/wiki/Variation_Selectors_Supplement
Emoji_Modifiers = re.compile(u'([\U0000FE00-\U0000FE0F])|([\U000E0100-\U000E0100])')
mytext_mod_gone = Emoji_Modifiers.sub(r'', mytext) 
print ("Modifiers Removed:")
print (mytext_mod_gone, "\n")

# All emoticons    
find_regex      = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
# Heart emoticons
#find_regex     = re.compile(u"([\U00002619])|([\U00002661])|([\U00002665])|([\U00002763])|([\U00002764])|([\U00002765])|([\U00002766])|([\U00002767])|([\U00002E96])|([\U00002E97])|([\U00002F3C])|([\U0001F394])|([\U0001F48C])|([\U0001F48F])|([\U0001F491])|([\U0001F493])|([\U0001F494])|([\U0001F495])|([\U0001F496])|([\U0001F497])|([\U0001F498])|([\U0001F499])|([\U0001F49A])|([\U0001F49B])|([\U0001F49C])|([\U0001F49D])|([\U0001F49E])|([\U0001F49F])|([\U0001F4D6])|([\U0001F5A4])|([\U0001F60D])|([\U0001F618])|([\U0001F63B])|([\U0001F970])|([\U0001F9E1])")
# Alphanumeric + punctuation for an alternative solution
#find_regex     = re.compile(r"[^a-zA-Z0-9!,.?!#&'()*+,-./:;<=>?@\^_`{|}~\s]") # 

mytext_emoji_gone = find_regex.sub(r'', mytext)

I am falling down at:

Negating unicode with a Negative Lookbehind (?<!...). I don't understand the operands well enough, and regex101.com only works with r', not u'.
Combining multiple regex together in a re.compile. Say if I wanted to keep alphanumeric and my emoji, it complains when I do re.compile(u'(\Uxxxx)' | r'(regex)' ). unsupported operand type(s) for |: 'str' and 'str', so a OR type statement does not work here...and an OR gives undesired results.

Could I have some help with either:

Ignoring a subset of emoticons and deleting the rest (my preferred solution)
Keeping (alphanumeric, punctuation, and my emoticons), and deleting the rest.
A specific question: Can you 'stack' re.compiles? IE create 2 different re.compiles to match (or not match) things, then join them together.

Actually, you are using a wrong regex to match emojis, you are matching a lot of other things, not just emojis, and you miss a lot of those emojis that consist of more then 2 bytes. As you are using Python 3.x, you should discard `u` prefix, all strings are UTF8 strings by default. And to solve the issue, use a negative lookahead — Wiktor Stribiżew, Mar 13 '19 at 08:44
See https://regex101.com/r/rwTlgF/1, `'(?![\U00002619\U00002661\U00002665\U00002763\U00002764\U00002765\U00002766\U00002767\U00002E96\U00002E97\U00002F3C\U0001F394\U0001F48C\U0001F48F\U0001F491\U0001F493\U0001F494\U0001F495\U0001F496\U0001F497\U0001F498\U0001F499\U0001F49A\U0001F49B\U0001F49C\U0001F49D\U0001F49E\U0001F49F\U0001F4D6\U0001F5A4\U0001F60D\U0001F618\U0001F63B\U0001F970\U0001F9E1])[\U00002600-\U000027BF\U0001f300-\U0001f64F\U0001f680-\U0001f6FF]'` — Wiktor Stribiżew, Mar 13 '19 at 08:44
@WiktorStribiżew. Perhaps I was thinking about this wrong. I really only want my emoticons, alphanumeric, and punctuation. I could just combine a negative lookahead for my emoticons, coupled with an alphanumeric punctuation one. If there are 2 byte emoticons and this matches then I think the case will be rare. They will certainly stand out. I can paste those into google sheets and use the unicode function there to find their codes, and write some python to capture those. Thank you for the negative lookahead example. I will test it out now — DaftVader, Mar 13 '19 at 23:07
See [Emooji v12.0](https://unicode.org/emoji/charts/full-emoji-list.html), e.g. #391*man lifting weights*. See how many bytes it consists of. Your emoji regex will only match a fraction of it. — Wiktor Stribiżew, Mar 14 '19 at 07:44

score 1 · Answer 1 · answered Mar 13 '19 at 09:22

1

regex101 has a Unicode option, it is a flag you can turn on from the right side of the regex box.

I think the easiest thing to do is to find all the emojis in the string except for the ones you want to keep and replace them with an empty string like you wanted to do. To do that you can use a regex that will find any emoji (for this example I'll use [\U00010000-\U0010ffff] but I'm sure there are better ones out there so use one of those) and add a negative look ahead to ignore the emoji you wish to keep.

The finale regex should look similar to this:

(?![\u2764])[\U00010000-\U0010ffff]

The first part (?![\u2764]) will make sure the match is not an emoji you wish to keep and the second part [\U00010000-\U0010ffff] will make sure it's an emoji

You can add all the other emojis you wish to keep in the square brackets (?![\u2764 here ])

answered Mar 13 '19 at 09:22

Gilad Shnoor

374
3
12

The "Unicode option" has nothing to do with the `u` string literal prefix OP is using. No need to pay attention to it, the `u` is redundant in OP code. If OP uses the correct emoji regex, the negative lookahead may fail in case the negated emoji is a starting part of a longer allowed emoji. – Wiktor Stribiżew Mar 13 '19 at 13:41
@WiktorStribiżew The part about "Unicode option" was just in response to the OP saying that regex101 doesn't have a "Unicode option" I agree that It's not going to help the OP with his question but I think it's good to know. If you think I should delete it I'll do so. About the negative lookahead negating parts of longer emojis: are you sure? I thought that the regex treats the whole emoji as one char so it will not be a problem – Gilad Shnoor Mar 13 '19 at 14:23
@WiktorStribiżew Can you explain "fail" in regards to longer allowed emoji. Will it match my emoticons AND longer emoticons, or will it match neither which will result in them being deleted? – DaftVader Mar 13 '19 at 23:10
1

@GiladShnoor Ah thank you for the regex101 unicode flag on the right :) Thank you for showing the correct syntax for putting a unicode inside a negative lookahead. New things to try out today :) Thank you and Wiktor for helping a newbie. – DaftVader Mar 13 '19 at 23:14

score 1 · Answer 2 · answered Mar 13 '19 at 23:50

1

I went with:

find_regex     = re.compile(u"(?![\U00002619])(?![\U00002661])(?![\U00002665])(?![\U00002763])(?![\U00002764])(?![\U00002765])(?![\U00002766])(?![\U00002767])(?![\U00002E96])(?![\U00002E97])(?![\U00002F3C])(?![\U0001F394])(?![\U0001F48C])(?![\U0001F48F])(?![\U0001F491])(?![\U0001F493])(?![\U0001F494])(?![\U0001F495])(?![\U0001F496])(?![\U0001F497])(?![\U0001F498])(?![\U0001F499])(?![\U0001F49A])(?![\U0001F49B])(?![\U0001F49C])(?![\U0001F49D])(?![\U0001F49E])(?![\U0001F49F])(?![\U0001F4D6])(?![\U0001F5A4])(?![\U0001F60D])(?![\U0001F618])(?![\U0001F63B])(?![\U0001F970])(?![\U0001F9E1])"r"[^a-zA-Z0-9!,.?!#&'()*+,-./:;<=>?@\^_`{|}~\s]")

mytext_emoji_gone = find_regex.sub(r'', mytext)

which stripped out all other emoji, leaving only the heart and book emojis, and alphanumeric and punctuation.

As part of my original question, is there a way to stack those? Currently, that is one huge long line of code. Could we do something like:

regex = re.compile(a)
regex += re.compile(b)

That would use vertial real estate but I am ok with that

answered Mar 13 '19 at 23:50

DaftVader

105
1
11

You should avoid asking separate questions in your answers. If you have another smaller issue related to your question, you can add it as a comment, an [edit to your question](https://stackoverflow.com/posts/55132461/edit), or a [new question](https://stackoverflow.com/questions/ask) with a reference to this question. – Hoppeduppeanut Mar 13 '19 at 23:56
In your case, this already has an answer here: https://stackoverflow.com/questions/33211404/python-how-do-i-do-line-continuation-with-a-long-regex – Hoppeduppeanut Mar 13 '19 at 23:58
I also found diacritics, which have to be removed prior to stripping everything else out. See this answer, worked wonders :) https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string – DaftVader Mar 14 '19 at 02:23
@Hoppeduppeanut Not quite - my question is not answered there. That's related to multiline editing whether with strings or re.verbose. Whereas I wanted to know in *my original question* if 2 re.compiles could be added together. In hindsight, I should have added my regex =+ re.compile(b) example in my original question to make my question clearer. – DaftVader Mar 15 '19 at 07:00

Python3 regex: Keep some Emojis, discard the rest

2 Answers2