4

I have DataFrame in Python Pandas like below:

sentence
------------

I like it
+1
One :-) :)
hah

I need to select only rows containing emoticons or emojis, so as a result I need something like below:

sentence
------------

+1
One :-) :)

How can I do that in Python ?

dingaro
  • 2,156
  • 9
  • 29
  • 1
    you can select the emoji with unicode, but `:-)` is tricky – mozway Jun 02 '22 at 13:28
  • You could maybe create a table to serve as a dataset with hardcoded emojis that arent actual emojis like ":)" and ":-)" and so on? And then check or match your sentences with those or if those sentences contain any elements of that dataset of hardcoded emojis? – Josip Juros Jun 02 '22 at 13:31
  • Have you defined a set of emoticons you want to find? You could maybe put together a regex pattern if its just combos of `eye_character nose_character mouth_character` – 0x263A Jun 02 '22 at 13:33
  • Does this answer your question? [How to extract all the emojis from text?](https://stackoverflow.com/questions/43146528/how-to-extract-all-the-emojis-from-text) – eshirvana Jun 02 '22 at 13:33
  • eshirvana, but how to use some function from your link to my DataFrame, moreover I need to select rows with emoji and rows with emoticons, so not only emojis :) – dingaro Jun 02 '22 at 13:34
  • @eshirvana You need to do some more reading of the comments here. – Josip Juros Jun 02 '22 at 13:35

1 Answers1

6

You can select the unicode emojis with a regex range:

df2 = df[df['sentence'].str.contains(r'[\u263a-\U0001f645]')]

output:

  sentence
0      
2     +1

This is however much more ambiguous for the ASCII "emojis" as there is no standard definition and probably endless combinations. If you limit it to the smiley faces that contain eyes ';:' and a mouth ')(' you could use:

df[df['sentence'].str.contains(r'[\u263a-\U0001f645]|(?:[:;]\S?[\)\(])')]

output:

     sentence
0         
2        +1
3  One :-) :)

But you would be missing plenty of potential ASCII possibilities: :O, :P, 8D, etc.

mozway
  • 194,879
  • 13
  • 39
  • 75
  • Having a list of those ASCII emojis and then checking for hits that way is also a possibility? There cant be that many ASCII emojis? – Josip Juros Jun 02 '22 at 13:35
  • 1
    @JosipJuros as an enthusiast, oh yes there can https://en.wikipedia.org/wiki/List_of_emoticons – 0x263A Jun 02 '22 at 13:36
  • @Josip search for "ascii art" and you'll be amazed how creative ascii emojis can be (ツ) – mozway Jun 02 '22 at 13:37
  • Oh damn yes there is a ton yikes.... Then finding a generic solution is better at least for ones that have a ";:" and ")(" – Josip Juros Jun 02 '22 at 13:38
  • mozway, is it possible to take also sentences with :O, :P, 8D? do you know how to do that ? – dingaro Jun 02 '22 at 13:38
  • 1
    You can add more eyes/mouth characters but the more you add the more you risk to have edge cases with false positives, for instance `8D` could be found in a legitimate product ID, or `:P` in a sentence with missing space after the colon ( ͡° ͜ʖ ͡°) – mozway Jun 02 '22 at 13:38
  • 1
    Pedantic point: some ASCII emoji, such as `ツ`, are not in fact ASCII ;-) – snakecharmerb Jun 02 '22 at 14:20
  • 1
    @snakecharmerb very true, this one is actually a (unicode) Japanese katakana ;) – mozway Jun 02 '22 at 14:21