As a part of text classification problem I am trying to clean a text dataset. So far I was removing everything except text. Punctuation, numbers, emoji - everything was removed. Now I am trying to use emoji as features hence I want to retain words as well emoji.
First I am searching the emoji in the text and separating them from other words/emoji. This is because each emoji should be treated individually/separately. So I search an emoji and pad it with spaces at both its ends.
But I am at loss while figuring out how to combine the known regex for words and emoji. Here is my current code:
import re
def clean_text(raw_text):
padded_emoji_text = pad_emojis(raw_text)
print("Emoji padded text: " + padded_emoji_text)
reg = re.compile("[^a-zA-Z]") # line a
# old regex to remove everything except words
letters_only_text = reg.sub(' ', raw_text)
print("Cleaned text: " + letters_only_text)
# Code to remove everything except text and emojis
# How?
def pad_emojis(raw_text):
print("Original Text: " + raw_text)
reg = re.compile(u'['
u'\U0001F300-\U0001F64F'
u'\U0001F680-\U0001F6FF'
u'\u2600-\u26FF\u2700-\u27BF]',
re.UNICODE)
#padding the emoji with space at both ends
new_text = reg.sub(r' \g<0> ',raw_text)
return new_text
text = "I am very #happy man! but my wife is not . 99/33"
clean_text(text)
Current o/p:
Original Text: I am very #happy man! but my wife is not . 99/33
Emoji padded text: I am very #happy man! but my wife is not . 99/33
Cleaned text: I am very happy man but my wife is not
What I am trying to achieve:
I am very happy man but my wife is not
Questions:
1) How do I add the emoji regex to regex compilation along with the words regex? (line a)
2) Also can I achieve what I am seeking in a better way i.e. without having to write a separate function just to separate the emoji and pad them with spaces? I somehow feel this can be avoided.