2

I have a dataframe with a column "clear_message", and I created a column that counts all the words in each row.

history['word_count'] = history.clear_message.apply(lambda x: Counter(x.split(' ')))

For example, if the rows message is: Hello my name is Hello Then the counter in his row, will be Counter({'Hello': 2, 'is': 1, 'my': 1, 'name': 1})

The problem

I have emoji in my text, and I want also a counter for the emoji.

For example:

test = 'here sasdsa'
test_counter = Counter(test.split(' '))

The output is:

Counter({'sasdsa': 1, 'here': 1})

But I want:

Counter({'sasdsa': 1, '': 5, 'here':1})

Clearly the problem is that I'm using split(' ').

What I thought about:

Adding a space before and after the emoji. like:

test = '     here sasdsa'

And then use the split, which will work.

  1. Not sure this approach is the best.
  2. Not sure how to do it. (I do know that if i is an emoji, then if i in emoji.UNICODE_EMOJI will return true (the emoji package)).
sheldonzy
  • 5,505
  • 9
  • 48
  • 86

2 Answers2

2

I think your idea of adding a space after each emoji is a good approach. You'll also need to strip white space in case there already was a space between an emoji and the next character, but that's simple enough. Something like:

def emoji_splitter(text):
    new_string = ""
    for char in text:
        if char in emoji.UNICODE_EMOJI:
            new_string += " {} ".format(char)
        else:
            new_string += char
    return [v for v in map(lambda x: x.strip(), new_string.split(" ")) if v != ""]

Maybe you could improve this by using a sliding window to check for spaces after emojis and only add spaces where necessary, but that would assume there will only ever be one space, where as this solution should account for 0 to n spaces between emojis.

ConorSheehan1
  • 1,640
  • 1
  • 18
  • 30
  • 1
    There were a few problems with this code, like counting `''` and no space before the first emoji. I added a few stuff, and posted it here as an answer. Tell me what you think. Thanks :) – sheldonzy Nov 19 '17 at 18:41
1

there was some problems with @con-- answer, so I fixed it.

def emoji_splitter(text):
    new_string = ""
    text = text.lstrip()
    if text:
        new_string += text[0] + " "
    for char in ' '.join(text[1:].split()):
        new_string += char
        if char in emoji.UNICODE_EMOJI:
            new_string = new_string + " " 
    return list(map(lambda x: x.strip(), new_string.split()))

example:

emoji_splitter(' a ads')
Out[7]: ['a', '', '', '', 'ads']
sheldonzy
  • 5,505
  • 9
  • 48
  • 86
  • 1
    Good catch! I did miss the case about not having a space before the first emoji. However, I don't think your solution fully solves it either. For example if you try ```emoji_splitter('aa ads')``` your code should return ```['a', 'a', '', '', 'ads']``` since you insert a space between the first and second character, not the first emoji. I've edited my answer and I think I've accounted for all cases now, by surrounding all emojis in spaces, splitting on spaces and then removing any empty strings left in the list. – ConorSheehan1 Nov 19 '17 at 20:00