4

As a part of text classification problem I am trying to clean a text dataset. So far I was removing everything except text. Punctuation, numbers, emoji - everything was removed. Now I am trying to use emoji as features hence I want to retain words as well emoji.

First I am searching the emoji in the text and separating them from other words/emoji. This is because each emoji should be treated individually/separately. So I search an emoji and pad it with spaces at both its ends.

But I am at loss while figuring out how to combine the known regex for words and emoji. Here is my current code:

import re

def clean_text(raw_text):

    padded_emoji_text = pad_emojis(raw_text)
    print("Emoji padded text: " + padded_emoji_text)

    reg = re.compile("[^a-zA-Z]") # line a

    # old regex to remove everything except words  
    letters_only_text = reg.sub(' ', raw_text)
    print("Cleaned text: " + letters_only_text)

    # Code to remove everything except text and emojis
    # How?

def pad_emojis(raw_text):

    print("Original Text: " + raw_text)

    reg = re.compile(u'['
      u'\U0001F300-\U0001F64F'
      u'\U0001F680-\U0001F6FF'
      u'\u2600-\u26FF\u2700-\u27BF]', 
      re.UNICODE)

    #padding the emoji with space at both ends
    new_text = reg.sub(r' \g<0> ',raw_text) 

    return new_text

text = "I am very #happy man! but my wife is not . 99/33"
clean_text(text)

Current o/p:

Original Text: I am very #happy man! but my wife is not . 99/33
Emoji padded text: I am very #happy man! but     my wife   is not     . 99/33
Cleaned text: I am very  happy man  but   my wife  is not

What I am trying to achieve:

I am very happy man but     my wife   is not    

Questions:

1) How do I add the emoji regex to regex compilation along with the words regex? (line a)

2) Also can I achieve what I am seeking in a better way i.e. without having to write a separate function just to separate the emoji and pad them with spaces? I somehow feel this can be avoided.

Ravindra S
  • 6,302
  • 12
  • 70
  • 108
  • See [this Python 3 demo](http://rextester.com/YKDXU24273) - I think it shows a way to do that in 1 step. Just not sure if you need to "shrink" whitespaces or not, your expected result differs a bit from what I get. – Wiktor Stribiżew May 21 '17 at 20:30
  • Hey that's great! It definitely works. I have tried many use cases and it seemed to work fine on all of them. And yes I need to shrink the whitespaces which was the last step of the text cleaning which I didn't include the question. Thanks for taking care of that. Now can you please add this is an answer? Also the regex is too complex for me to understand. It will be great if you can explain it to some extent in your answer. Thanks a lot! – Ravindra S May 21 '17 at 20:39
  • Ok, just give me a second, I will also add a multiple whitespace shrinking here. – Wiktor Stribiżew May 21 '17 at 20:40

1 Answers1

5

You may join the two steps into one using a single regex and a lambda expression inside a re.sub like this:

import re

emoji_pat = '[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF]'
shrink_whitespace_reg = re.compile(r'\s{2,}')

def clean_text(raw_text):
    reg = re.compile(r'({})|[^a-zA-Z]'.format(emoji_pat)) # line a
    result = reg.sub(lambda x: ' {} '.format(x.group(1)) if x.group(1) else ' ', raw_text)
    return shrink_whitespace_reg.sub(' ', result)

text = 'I am very #happy man! but my wife is not . 99/33'
print('Cleaned text: ' + clean_text(text))
# => Cleaned text: I am very happy man but   my wife  is not  

See the Python demo

Explanation:

  • The first regex will look like ([\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF])|[^A-Za-z] and will match and capture into Group 1 an emoji or will just match any char other than an ASCII letter. If the emoji was captured (see if x.group(1) inside the lambda), the emoji will be returned back enclosed with spaces on both sides, else, the space will be used to replace a non-letter
  • The \s{2,} pattern will match 2 or more whitespaces and shrink_whitespace_reg.sub(' ', result) will replace all these chunks with a single whitespace.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563