I have a really long string containing emojis, and I want to "normalize" it. I make it all lowercase and all non-alphabetic letters replaced with their alphabetic analogue. My code is below:
std::string normalise(std::string to_normalise)
{
std::unordered_map<char, char> char_map = {
{'_', ' '}, {'á', 'a'}, {'à', 'a'},
{'â', 'a'}, {'ä', 'a'}, {'é', 'e'},
{'è', 'e'}, {'ê', 'e'}, {'ë', 'e'},
{'í', 'i'}, {'ì', 'i'}, {'î', 'i'},
{'ï', 'i'}, {'ó', 'o'}, {'ò', 'o'},
//etc
};
std::string result;
// Loop over the input string
for (char c : to_normalise)
{
// If the character is in the map, replace it
if (char_map.count(c))
{
result += std::tolower(char_map[c]);
continue;
}
else if (std::isalpha(c))
{
// If the character is alphabetic, add it to the output string in lowercase
result += std::tolower(c);
continue;
}
else if (std::ispunct(c) || std::isemoji(c))
{
result += ' ';
result += c;
result += ' ';
continue;
}
else if (std::isspace(c))
{
result += c;
continue;
}
}
return result;
}
The idea is to separate punctuation and emojis as separate words. Pretty simple task, so I made this function
namespace std
{
bool is_emoji(char c) {
int codepoint = static_cast<unsigned char>(c);
return codepoint >= 0x1F600 && codepoint <= 0x1F64F;
}
}
However, it gives me something like 130 for a smiley face () which is totally not in the rage of Unicode emojis. Sometime later I came up with that which I thought is a) much simpler, b) should work right.
namespace std
{
bool isemoji(char c)
{
return (c >= 0x1F600 && c <= 0x1F64F)
}
}
Nevertheless, it failed. With a little testing, I found out that emojis are 4 bytes (multibyte char?), so actually I have no idea what to do, except maybe use some per-byte operations. But I don't know how.
Some explanation of how char can be of different sizes depending on what char is would be great too.