19

I've got a text input from a mobile device. It contains emoji. In C#, I have the text as

Text  text

Simply put, I want the output text to be

Text text

I'm trying to just remove all such emojis from the text with rejex.. except, I'm not sure how to convert that emoji into it's unicode sequence.. How do I do that?

edit:

I'm trying to save the user input into mysql. It looks like mysql UTF8 doesn't really support unicode characters and the right way to do it would be by changing the schema but I don't think that is an option for me. So I'm trying to just remove all the emoji characters before saving it in the database.

This is my schema for the relevant column:

enter image description here

I'm using Nhibernate as my ORM and the insert query generated looks like this:

Insert into `Content` (ContentTypeId, Comments, DateCreated) 
values (?p0, ?p1, ?p2);
?p0 = 4 [Type: Int32 (0)]. ?p1 = 'Text  text' [Type: String (20)], ?p2 = 19/01/2015 10:38:23 [Type: DateTime (0)]

When I copy this query from logs and run it on mysql directly, I get this error:

1 warning(s): 1366 Incorrect string value: '\xF0\x9F\x98\x80 t...' for column 'Comments' at row 1   0.000 sec

Also, I've tried to convert it into encoding bytes and it doesn't really work..

enter image description here

Community
  • 1
  • 1
LocustHorde
  • 6,361
  • 16
  • 65
  • 94
  • UTF-8 really *should* be fine here. Can you post the details of how you're currently trying to save the data, along with your schema information? – Jon Skeet Jan 19 '15 at 11:41
  • 1
    See here: https://gist.github.com/adamlwatson/9623703 – Octopoid Jan 19 '15 at 11:41
  • (Assuming you actually want to remove them, rather than sort your encoding) – Octopoid Jan 19 '15 at 11:42
  • @JonSkeet added the info. – LocustHorde Jan 19 '15 at 11:58
  • 2
    @LocustHorde Which version of MySQL are you running on? Seemingly the character set utf8mb4 should make everything tikitiboo... have a read of the answer here http://stackoverflow.com/questions/24253985/mysql-utf-8-and-emoji-characters "It seems that MySQL supports two forms of unicode ucs2 which is 16-bits per character and utf8 up to 3 bytes per character. The bad news is that neither form is going to support plane 1 characters which require at 17 bits. (mainly emoji). It looks like MySQL 5.5.3 and up also support utf8mb4, utf16, and utf32 and supplementary characters (read emoji)" – Paul Zahra Jan 19 '15 at 12:00
  • You haven't actually shown the code you're using. The error message doesn't seem to fit with the UTF-8 encoding for either of those values, which is odd... – Jon Skeet Jan 19 '15 at 12:00
  • @JonSkeet yea, I was testing with a few emojis so the message is for another emoji. Also, not sure what you mean by code? I'm using a regular nhibernate repository that saves the object with `public virtual String Comments { get; set; }` property. The insert query produced is fine, it's just that mysql db can't handle the unicode. – LocustHorde Jan 19 '15 at 12:04
  • @PaulZahra I don't think changing the schema is an option, but will try talk to dba about it! what I need is something like what Octopid has mentioned, but in c#, but I just can't seem to be able to regex the emojis! – LocustHorde Jan 19 '15 at 12:08
  • 3
    Something to be aware of from http://stackoverflow.com/questions/10992921/how-to-remove-emoji-code-using-javascript "However, note that there are other characters in the Basic Multilingual Plane that are used as emoji by phones but which long predate emoji. For example U+2665 is the traditional Heart Suit character ♥, but it my be rendered as an emoji graphic on some devices. It's up to you whether you treat this as emoji and try to remove it." – Paul Zahra Jan 19 '15 at 12:32
  • 1
    Octopoid's gist doesn't convert them, it *removes* them. If you want to just remove any characters not in the BMP, that's reasonably easy. – Jon Skeet Jan 19 '15 at 12:46
  • @JonSkeet yup - I do want to just remove them! but to remove them I must regex match them and that's where I'm stuck now. – LocustHorde Jan 19 '15 at 13:23
  • "So convert to corresponding \uxxxx characters" is just a red herring? – Jon Skeet Jan 19 '15 at 13:30

1 Answers1

56

Assuming you just want to remove all non-BMP characters, i.e. anything with a Unicode code point of U+10000 and higher, you can use a regex to remove any UTF-16 surrogate code units from the string. For example:

using System;
using System.Text.RegularExpressions;

class Test
{
    static void Main(string[] args)
    {
        string text = "x\U0001F310y";
        Console.WriteLine(text.Length); // 4
        string result = Regex.Replace(text, @"\p{Cs}", "");
        Console.WriteLine(result); // 2
    }
}

Here "Cs" is the Unicode category for "surrogate".

It appears that Regex works based on UTF-16 code units rather than Unicode code points, otherwise you'd need a different approach.

Note that there are non-BMP characters other than emoji, but I suspect you'll find they'll have the same problem when you try to store them.

Additionally, not that this won't remove emojis in the BMP, such as U+2764 (red heart). You can use the above as an example of how to remove characters in specific Unicode categories - the category for U+2764 is "other symbol" for example. Now whether you want to remove all "other symbols" is a different matter.

But if really you're interested in just removing surrogate pairs because they can't be stored properly, the above should be fine.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Hi, I made the question to describe what I thought was my problem.. but I tried out your answer and it turns out I don't actually need to convert them.. So I have edited the question now! http://i.imgur.com/NoQfxud.png Thank you! – LocustHorde Jan 19 '15 at 14:48
  • @LocustHorde: So long as you're aware that you're just throwing away bits of the user's input... – Jon Skeet Jan 19 '15 at 14:54
  • Yea! this is a temporary solution (hopefully short term!) – LocustHorde Jan 19 '15 at 15:04
  • Hi @JonSkeet, I'm trying to use your Regex to detect if emojis are included in a string (pretty much the exact same code). For some reason `\p{Cs}` does not catch all emojis. Do you know anything about this by any chance? I've tried about 30 of them and one or two were not detected. I'm assuming they're not in the range of that regex, but i'd like your expert opinion since I know nothing about surrogates and very little about chars in general – Gil Sand Oct 24 '17 at 07:43
  • 1
    @GilSand: Well, did you look at what Unicode categories those characters are in? It's probably best to ask a new question with a complete example, rather than "one or two of them" (leaving us guessing which). We can then look at what's going on much more easily. – Jon Skeet Oct 24 '17 at 07:49
  • 1
    @JonSkeet You're right. Here's a link to the new question for you or future travelers : https://stackoverflow.com/questions/46905176/detecting-all-emojis – Gil Sand Oct 24 '17 at 08:02
  • This won't remove all emojis because some emojis such as ❤ are in the BMP. – Clement Apr 28 '23 at 01:20
  • @Clement: Thanks for pointing that out; I've added some more text at the end. – Jon Skeet Apr 28 '23 at 06:17
  • Regex.Replace(str, @"[\p{So}\p{Cs}]", string.Empty) seems to remove additional emojis that are in the BMP – Clement May 17 '23 at 01:33
  • 1
    @Clement: Yes, but it will also remove "other symbols" that aren't emojis... e.g. the copyright sign ©. If I were only trying to remove emoji, I wouldn't expect the copyright sign to be removed. – Jon Skeet May 17 '23 at 06:11