noob here. I have strings where I want to keep some emoji and to discard the rest.
INPUT:
This book is so funny❤️. This book is the bomb(AS IN THE BEST IN THE WORLD )I love it!I definitely recommend it!'
DESIRED OUTPUT:
This book is so funny❤️. This book is the bomb(AS IN THE BEST IN THE WORLD )I love it!I definitely recommend it!'
I have the re.compile that matches:
- my emoji
- all emoji Removing Emoticons from..... See David Mabodo answer
I don't know how to put it together in re.compile that excludes one from the other. Alternatively keep alphanumeric, punctuation, and my emoji, and substitute the rest to "".
mytext = This book is so funny❤️. This book is the bomb(AS IN THE BEST
IN THE WORLD )I love it!I definitely recommend it!'
# Desired out put:
# u'This book is so funny❤️. This book is the bomb(AS IN THE BEST
IN THE WORLD )I love it!I definitely recommend it!'
print ("Original text:")
print (mytext, "\n")
# Strip out emoticon modifiers, leaving a simplified emoticon to work with.
# https://en.wikipedia.org/wiki/Variation_Selectors_(Unicode_block)
# https://en.wikipedia.org/wiki/Variation_Selectors_Supplement
Emoji_Modifiers = re.compile(u'([\U0000FE00-\U0000FE0F])|([\U000E0100-\U000E0100])')
mytext_mod_gone = Emoji_Modifiers.sub(r'', mytext)
print ("Modifiers Removed:")
print (mytext_mod_gone, "\n")
# All emoticons
find_regex = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
# Heart emoticons
#find_regex = re.compile(u"([\U00002619])|([\U00002661])|([\U00002665])|([\U00002763])|([\U00002764])|([\U00002765])|([\U00002766])|([\U00002767])|([\U00002E96])|([\U00002E97])|([\U00002F3C])|([\U0001F394])|([\U0001F48C])|([\U0001F48F])|([\U0001F491])|([\U0001F493])|([\U0001F494])|([\U0001F495])|([\U0001F496])|([\U0001F497])|([\U0001F498])|([\U0001F499])|([\U0001F49A])|([\U0001F49B])|([\U0001F49C])|([\U0001F49D])|([\U0001F49E])|([\U0001F49F])|([\U0001F4D6])|([\U0001F5A4])|([\U0001F60D])|([\U0001F618])|([\U0001F63B])|([\U0001F970])|([\U0001F9E1])")
# Alphanumeric + punctuation for an alternative solution
#find_regex = re.compile(r"[^a-zA-Z0-9!,.?!#&'()*+,-./:;<=>?@\^_`{|}~\s]") #
mytext_emoji_gone = find_regex.sub(r'', mytext)
I am falling down at:
- Negating unicode with a Negative Lookbehind
(?<!...)
. I don't understand the operands well enough, and regex101.com only works with r', not u'. - Combining multiple regex together in a re.compile. Say if I wanted to keep alphanumeric and my emoji, it complains when I do
re.compile(u'(\Uxxxx)' | r'(regex)' )
. unsupported operand type(s) for |: 'str' and 'str', so a OR type statement does not work here...and an OR gives undesired results.
Could I have some help with either:
- Ignoring a subset of emoticons and deleting the rest (my preferred solution)
- Keeping (alphanumeric, punctuation, and my emoticons), and deleting the rest.
- A specific question: Can you 'stack' re.compiles? IE create 2 different re.compiles to match (or not match) things, then join them together.