Matching emojis in C#

Question

I am trying to find a way to filter emojis from utf8 text files. Apparently there is a javascript regex available (https://raw.githubusercontent.com/mathiasbynens/emoji-regex/master/index.js) which can be used to match emojis. I could not translate this regex to c# dialect (looks like there are some differences i don't understand). Then I tried following simple code to match all non-word and non-space characters in my texts (to go over them manually and select emojis, then put them in a regex and replace them with empty string).

string input = @"some path\";
            List<char> emojis = new List<char>();
            foreach(FileInfo file in new DirectoryInfo(input).GetFiles("*.txt", SearchOption.AllDirectories))
            {
                MatchCollection matches = Regex.Matches(File.ReadAllText(file.FullName), @"[^\w\s]{1}");
                foreach(Match match in matches)
                {
                    string value = match.Value;
                    foreach(char c in value.ToCharArray())
                    {
                        if(!emojis.Contains(c))
                        {
                            emojis.Add(c);
                        }
                    }
                }
            }
            foreach(char c in emojis)
            {
                File.AppendAllText(@"\\Emojis.txt", c.ToString()+"|");
            }

But I get exception in #develop

System.Text.EncoderFallbackException: Unable to translate Unicode character \uD83D at index 0 to specified code page.

Apparently it is not a good idea to convert regex matched characters to characters. Any ideas how can I fix this? Regards

You can use the Javascript regex pattern by replacing every `\u` with `\x`, as well as the `\*` near the beginning with `\\*` — Abion47, Dec 03 '16 at 16:16
I think there is no difference with \x or \u. I tried both way. With \x throws exception '[x-y] range in reverse order'. With \u throws exception 'not enough hexadecimal digits'. — Shakir, Dec 03 '16 at 16:36
This regex words in javascript online regex tester. So something has to be changed to make it work with c#. Perhaps the addition of hexadecimal pairs to compensate of character values above 64000 or something? — Shakir, Dec 03 '16 at 16:44
What is throwing that exception? I am using the pattern in a test program with no issues using precisely the differences I explained in my previous comment. — Abion47, Dec 03 '16 at 16:45
I have used the following code which throws exception in #develop. Regex regex = new Regex(@"", RegexOptions.Compiled); With and without edits. Nothing changes. — Shakir, Dec 03 '16 at 16:50
This is an online .NET regex tester. Throws parsing error. http://www.systemtextregularexpressions.com/regex.match — Shakir, Dec 03 '16 at 16:52
Are you trying to escape the `\x` characters in regex? You don't need to do that. Leave them as `\xABCD` and C# will escape them itself, which will give the regex a pattern with pre-escaped characters. — Abion47, Dec 03 '16 at 16:55
I use Expresso (http://www.ultrapico.com/expresso.htm) for regex testing in C#. Cannot parse it with changes or without changes. — Shakir, Dec 03 '16 at 16:56
Simply copy paste the original in code (and tester software). I use @ to so only single \ like \x. — Shakir, Dec 03 '16 at 16:58
Sharp develop. I do have VS as well. It is simply too heavy. Let me give it a try. — Shakir, Dec 03 '16 at 17:02
It is the same. I think it has something to do with hexa decimal value pairs in 16 bit. Ref: http://stackoverflow.com/questions/24840667/what-is-the-regex-to-extract-all-the-emojis-from-a-string — Shakir, Dec 03 '16 at 17:18

score 2 · Accepted Answer · answered Dec 03 '16 at 17:03

This is the code I am running in Visual Studio, and it executes without a problem.

string regex = "(?:0\x20E3|1\x20E3|2\x20E3|3\x20E3|4\x20E3|5\x20E3|6\x20E3|7\x20E3|8\x20E3|9\x20E3|#\x20E3|\\*\x20E3|\xD83C(?:\xDDE6\xD83C(?:\xDDE8|\xDDE9|\xDDEA|\xDDEB|\xDDEC|\xDDEE|\xDDF1|\xDDF2|\xDDF4|\xDDF6|\xDDF7|\xDDF8|\xDDF9|\xDDFA|\xDDFC|\xDDFD|\xDDFF)|\xDDE7\xD83C(?:\xDDE6|\xDDE7|\xDDE9|\xDDEA|\xDDEB|\xDDEC|\xDDED|\xDDEE|\xDDEF|\xDDF1|\xDDF2|\xDDF3|\xDDF4|\xDDF6|\xDDF7|\xDDF8|\xDDF9|\xDDFB|\xDDFC|\xDDFE|\xDDFF)|\xDDE8\xD83C(?:\xDDE6|\xDDE8|\xDDE9|\xDDEB|\xDDEC|\xDDED|\xDDEE|\xDDF0|\xDDF1|\xDDF2|\xDDF3|\xDDF4|\xDDF5|\xDDF7|\xDDFA|\xDDFB|\xDDFC|\xDDFD|\xDDFE|\xDDFF)|\xDDE9\xD83C(?:\xDDEA|\xDDEC|\xDDEF|\xDDF0|\xDDF2|\xDDF4|\xDDFF)|\xDDEA\xD83C(?:\xDDE6|\xDDE8|\xDDEA|\xDDEC|\xDDED|\xDDF7|\xDDF8|\xDDF9|\xDDFA)|\xDDEB\xD83C(?:\xDDEE|\xDDEF|\xDDF0|\xDDF2|\xDDF4|\xDDF7)|\xDDEC\xD83C(?:\xDDE6|\xDDE7|\xDDE9|\xDDEA|\xDDEB|\xDDEC|\xDDED|\xDDEE|\xDDF1|\xDDF2|\xDDF3|\xDDF5|\xDDF6|\xDDF7|\xDDF8|\xDDF9|\xDDFA|\xDDFC|\xDDFE)|\xDDED\xD83C(?:\xDDF0|\xDDF2|\xDDF3|\xDDF7|\xDDF9|\xDDFA)|\xDDEE\xD83C(?:\xDDE8|\xDDE9|\xDDEA|\xDDF1|\xDDF2|\xDDF3|\xDDF4|\xDDF6|\xDDF7|\xDDF8|\xDDF9)|\xDDEF\xD83C(?:\xDDEA|\xDDF2|\xDDF4|\xDDF5)|\xDDF0\xD83C(?:\xDDEA|\xDDEC|\xDDED|\xDDEE|\xDDF2|\xDDF3|\xDDF5|\xDDF7|\xDDFC|\xDDFE|\xDDFF)|\xDDF1\xD83C(?:\xDDE6|\xDDE7|\xDDE8|\xDDEE|\xDDF0|\xDDF7|\xDDF8|\xDDF9|\xDDFA|\xDDFB|\xDDFE)|\xDDF2\xD83C(?:\xDDE6|\xDDE8|\xDDE9|\xDDEA|\xDDEB|\xDDEC|\xDDED|\xDDF0|\xDDF1|\xDDF2|\xDDF3|\xDDF4|\xDDF5|\xDDF6|\xDDF7|\xDDF8|\xDDF9|\xDDFA|\xDDFB|\xDDFC|\xDDFD|\xDDFE|\xDDFF)|\xDDF3\xD83C(?:\xDDE6|\xDDE8|\xDDEA|\xDDEB|\xDDEC|\xDDEE|\xDDF1|\xDDF4|\xDDF5|\xDDF7|\xDDFA|\xDDFF)|\xDDF4\xD83C\xDDF2|\xDDF5\xD83C(?:\xDDE6|\xDDEA|\xDDEB|\xDDEC|\xDDED|\xDDF0|\xDDF1|\xDDF2|\xDDF3|\xDDF7|\xDDF8|\xDDF9|\xDDFC|\xDDFE)|\xDDF6\xD83C\xDDE6|\xDDF7\xD83C(?:\xDDEA|\xDDF4|\xDDF8|\xDDFA|\xDDFC)|\xDDF8\xD83C(?:\xDDE6|\xDDE7|\xDDE8|\xDDE9|\xDDEA|\xDDEC|\xDDED|\xDDEE|\xDDEF|\xDDF0|\xDDF1|\xDDF2|\xDDF3|\xDDF4|\xDDF7|\xDDF8|\xDDF9|\xDDFB|\xDDFD|\xDDFE|\xDDFF)|\xDDF9\xD83C(?:\xDDE6|\xDDE8|\xDDE9|\xDDEB|\xDDEC|\xDDED|\xDDEF|\xDDF0|\xDDF1|\xDDF2|\xDDF3|\xDDF4|\xDDF7|\xDDF9|\xDDFB|\xDDFC|\xDDFF)|\xDDFA\xD83C(?:\xDDE6|\xDDEC|\xDDF2|\xDDF8|\xDDFE|\xDDFF)|\xDDFB\xD83C(?:\xDDE6|\xDDE8|\xDDEA|\xDDEC|\xDDEE|\xDDF3|\xDDFA)|\xDDFC\xD83C(?:\xDDEB|\xDDF8)|\xDDFD\xD83C\xDDF0|\xDDFE\xD83C(?:\xDDEA|\xDDF9)|\xDDFF\xD83C(?:\xDDE6|\xDDF2|\xDDFC)))|[\xA9\xAE\x203C\x2049\x2122\x2139\x2194-\x2199\x21A9\x21AA\x231A\x231B\x2328\x23CF\x23E9-\x23F3\x23F8-\x23FA\x24C2\x25AA\x25AB\x25B6\x25C0\x25FB-\x25FE\x2600-\x2604\x260E\x2611\x2614\x2615\x2618\x261D\x2620\x2622\x2623\x2626\x262A\x262E\x262F\x2638-\x263A\x2648-\x2653\x2660\x2663\x2665\x2666\x2668\x267B\x267F\x2692-\x2694\x2696\x2697\x2699\x269B\x269C\x26A0\x26A1\x26AA\x26AB\x26B0\x26B1\x26BD\x26BE\x26C4\x26C5\x26C8\x26CE\x26CF\x26D1\x26D3\x26D4\x26E9\x26EA\x26F0-\x26F5\x26F7-\x26FA\x26FD\x2702\x2705\x2708-\x270D\x270F\x2712\x2714\x2716\x271D\x2721\x2728\x2733\x2734\x2744\x2747\x274C\x274E\x2753-\x2755\x2757\x2763\x2764\x2795-\x2797\x27A1\x27B0\x27BF\x2934\x2935\x2B05-\x2B07\x2B1B\x2B1C\x2B50\x2B55\x3030\x303D\x3297\x3299]|\xD83C[\xDC04\xDCCF\xDD70\xDD71\xDD7E\xDD7F\xDD8E\xDD91-\xDD9A\xDE01\xDE02\xDE1A\xDE2F\xDE32-\xDE3A\xDE50\xDE51\xDF00-\xDF21\xDF24-\xDF93\xDF96\xDF97\xDF99-\xDF9B\xDF9E-\xDFF0\xDFF3-\xDFF5\xDFF7-\xDFFF]|\xD83D[\xDC00-\xDCFD\xDCFF-\xDD3D\xDD49-\xDD4E\xDD50-\xDD67\xDD6F\xDD70\xDD73-\xDD79\xDD87\xDD8A-\xDD8D\xDD90\xDD95\xDD96\xDDA5\xDDA8\xDDB1\xDDB2\xDDBC\xDDC2-\xDDC4\xDDD1-\xDDD3\xDDDC-\xDDDE\xDDE1\xDDE3\xDDEF\xDDF3\xDDFA-\xDE4F\xDE80-\xDEC5\xDECB-\xDED0\xDEE0-\xDEE5\xDEE9\xDEEB\xDEEC\xDEF0\xDEF3]|\xD83E[\xDD10-\xDD18\xDD80-\xDD84\xDDC0]";
string input = "aedfvwefervsreA";
var result = Regex.Match(input, regex);

I don't have a test string to see if it produces the correct results, but that should be a one-to-one representation of the Javascript pattern in C#.

It works I think. ™|®|♥|©|||❤|||☺|♣|||||||||||||||||✌||||||||||||||||||||||||||||||| Thanks a lot. — Shakir, Dec 03 '16 at 17:29

Matching emojis in C#

1 Answers1

Linked