2

I have a really long string containing emojis, and I want to "normalize" it. I make it all lowercase and all non-alphabetic letters replaced with their alphabetic analogue. My code is below:

std::string normalise(std::string to_normalise)
{
    std::unordered_map<char, char> char_map = {
        {'_', ' '}, {'á', 'a'}, {'à', 'a'},
        {'â', 'a'}, {'ä', 'a'}, {'é', 'e'},
        {'è', 'e'}, {'ê', 'e'}, {'ë', 'e'},
        {'í', 'i'}, {'ì', 'i'}, {'î', 'i'},
        {'ï', 'i'}, {'ó', 'o'}, {'ò', 'o'},
        //etc
    };

    std::string result;

    // Loop over the input string
    for (char c : to_normalise)
    {
        // If the character is in the map, replace it
        if (char_map.count(c))
        {
            result += std::tolower(char_map[c]);
            continue;
        }
        else if (std::isalpha(c))
        {
            // If the character is alphabetic, add it to the output string in lowercase
            result += std::tolower(c);
            continue;
        }
        else if (std::ispunct(c) || std::isemoji(c))
        {
            result += ' ';
            result += c;
            result += ' ';
            continue;
        }
        else if (std::isspace(c))
        {
            result += c;
            continue;
        }
    }

    return result;
}

The idea is to separate punctuation and emojis as separate words. Pretty simple task, so I made this function

namespace std
{
    bool is_emoji(char c) {
        int codepoint = static_cast<unsigned char>(c);  
        return codepoint >= 0x1F600 && codepoint <= 0x1F64F;
    }
}

However, it gives me something like 130 for a smiley face () which is totally not in the rage of Unicode emojis. Sometime later I came up with that which I thought is a) much simpler, b) should work right.

namespace std
{
    bool isemoji(char c)
    {
        return (c >= 0x1F600 && c <= 0x1F64F)
    }
}

Nevertheless, it failed. With a little testing, I found out that emojis are 4 bytes (multibyte char?), so actually I have no idea what to do, except maybe use some per-byte operations. But I don't know how.

Some explanation of how char can be of different sizes depending on what char is would be great too.

Hudson
  • 312
  • 2
  • 18
SL07
  • 23
  • 4
  • 5
    `char` is (more than likely) an 8 bit data type. It will never have a value in the range of `[0x1F600, 0x1F64F]`. C++ doesn't really do unicode, I suggest getting yourself a library that does like ICU. – NathanOliver Feb 27 '23 at 19:56
  • 2
    `char16_t` - type for UTF-16 character representation, required to be large enough to represent any UTF-16 code unit (16 bits). `char32_t` - type for UTF-32 character representation, required to be large enough to represent any UTF-32 code unit (32 bits). But even with all that, Nathan's correct that *out of the box* standard C++ doesn't really do Unicode. ICU will provide much better Unicode support. – Eljay Feb 27 '23 at 20:00
  • @NathanOliver but if I just `std::cout << textfile << std::endl;` it displays emojis just fine. – SL07 Feb 27 '23 at 20:01
  • @SL07 An individual `char` can't be an emoji, but multiple in a row can be. That's why you can print them out from a string. It's sort of (but not) like how `:` and `D` on their own are punctuation and a letter, but `:D` is a smiley face. – Kevin Feb 27 '23 at 20:06
  • While you can do that, the emoji isn't using just one element of the string to store its value. print each character separately like `for (auto e : my_string) std::cout << e;` then the emoji will go away and you will see its constituent parts. – NathanOliver Feb 27 '23 at 20:06
  • @NathanOliver You'll need a separator or something between each one to see them separately (or print each one as an integer). Printing a string all at once is equivalent to printing each element individually. – Kevin Feb 27 '23 at 20:09
  • Ok, got it... I just saw thet when debugging it seems to make 4 chars of some weird value, not just 1... – SL07 Feb 27 '23 at 20:09
  • Also, you aren't allowed to define your own functions inside of the `std` namespace. You have to put it somewhere else – Kevin Feb 27 '23 at 20:15
  • 1
    Does https://stackoverflow.com/questions/68914658/splitting-a-string-that-contains-emojis-and-latin-letters/68915538#68915538 answer your question? Well, different image, but I have a function `is_emoji` there. Wait a sec, `char` can never be an emoji. What _character encoding_ are you operating on? – KamilCuk Mar 13 '23 at 20:32

0 Answers0