I am trying to find a way to filter emojis from utf8 text files. Apparently there is a javascript regex available (https://raw.githubusercontent.com/mathiasbynens/emoji-regex/master/index.js) which can be used to match emojis. I could not translate this regex to c# dialect (looks like there are some differences i don't understand). Then I tried following simple code to match all non-word and non-space characters in my texts (to go over them manually and select emojis, then put them in a regex and replace them with empty string).
string input = @"some path\";
List<char> emojis = new List<char>();
foreach(FileInfo file in new DirectoryInfo(input).GetFiles("*.txt", SearchOption.AllDirectories))
{
MatchCollection matches = Regex.Matches(File.ReadAllText(file.FullName), @"[^\w\s]{1}");
foreach(Match match in matches)
{
string value = match.Value;
foreach(char c in value.ToCharArray())
{
if(!emojis.Contains(c))
{
emojis.Add(c);
}
}
}
}
foreach(char c in emojis)
{
File.AppendAllText(@"\\Emojis.txt", c.ToString()+"|");
}
But I get exception in #develop
System.Text.EncoderFallbackException: Unable to translate Unicode character \uD83D at index 0 to specified code page.
Apparently it is not a good idea to convert regex matched characters to characters. Any ideas how can I fix this? Regards