7

I'm trying to take a string of emojis and split them into a vector of each emoji

Given the string:

std::string emojis = "";

I'm trying to get:

std::vector<std::string> splitted_emojis = {"", "", "", "", "", "", "", ""};

Edit

I've tried to do:

std::string emojis = "";
std::vector<std::string> splitted_emojis;
size_t pos = 0;
std::string token;
while ((pos = emojis.find("")) != std::string::npos)
{
    token = emojis.substr(0, pos);
    splitted_emojis.push_back(token);
    emojis.erase(0, pos);
}

But it seems like it throws terminate called after throwing an instance of 'std::bad_alloc' after a couple of seconds.
When trying to check how many emojis are in a string using:

std::string emojis = "";
std::cout << emojis.size() << std::endl; // returns 32

it returns a bigger number which i assume are the unicode data. I don't know too much about unicode data but i'm trying to figure out how to check for when the data of an emoji begins and ends to be able to split the string to each emoji

DaStrangeBoi
  • 132
  • 1
  • 10
  • 5
    How much do you know about unicode and character encodings? – user4581301 Aug 06 '20 at 03:16
  • And what did the debugger say? – kesarling He-Him Aug 06 '20 at 04:17
  • @d4rk4ng31 I use VSCode without a debugger setup – DaStrangeBoi Aug 06 '20 at 04:35
  • 1
    @d4rk4ng31 VSCode is a fully capable editor (and *the* most popular according to the SO developer survey). You just need to set it up with a debugger. – eesiraed Aug 06 '20 at 05:42
  • use a Unicode library instead. C++ stdlib doesn't have good support Unicode support and can't know the UTF-8 character boundaries – phuclv Aug 06 '20 at 05:43
  • "I don't know too much about unicode data", so read the classic [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) – Botje Aug 06 '20 at 07:21
  • @Botje ill make sure to give it a read, thanks! – DaStrangeBoi Aug 06 '20 at 12:35

1 Answers1

3

I would definitely recommend that you use a library with better unicode support (all large frameworks do), but in a pinch you can get by with knowing that the UTF-8 encoding spreads Unicode characters over multiple bytes, and that the first bits of the first byte determine how many bytes a character is made up of.

I stole a function from boost. The split_by_codepoint function uses an iterator over the input string and constructs a new string using the first N bytes (where N is determined by the byte count function) and pushes it to the ret vector.

// Taken from boost internals
inline unsigned utf8_byte_count(uint8_t c)
{
  // if the most significant bit with a zero in it is in position
  // 8-N then there are N bytes in this UTF-8 sequence:
  uint8_t mask = 0x80u;
  unsigned result = 0;
  while(c & mask)
  {
    ++result;
    mask >>= 1;
  }
  return (result == 0) ? 1 : ((result > 4) ? 4 : result);
}

std::vector<std::string> split_by_codepoint(std::string input) {
  std::vector<std::string> ret;
  auto it = input.cbegin();
  while (it != input.cend()) {
    uint8_t count = utf8_byte_count(*it);
    ret.emplace_back(std::string{it, it+count});
    it += count;
  }
  return ret;
}

int main() {
    std::string emojis = u8"";
    auto split = split_by_codepoint(emojis);
    std::cout << split.size() << std::endl;
}

Note that this function simply splits a string into UTF-8 strings containing one code point each. Determining if the character is an emoji is left as an exercise: UTF-8-decode any 4-byte characters and see if they are in the proper range.

Botje
  • 26,269
  • 3
  • 31
  • 41
  • can we use a like a wchar for emojis ? :P – Tilak Madichetti Aug 06 '20 at 08:16
  • Only if your platform has 32bit wchar_t and you can live with the wasted memory. Qt's QString uses UTF-32 internally, for example. Microsoft has 16bit wchar_t so it still needs surrogate characters to represent emoji. – Botje Aug 06 '20 at 08:19
  • Alternatively, you can use a [UTF-8 iterator adaptor](https://www.boost.org/doc/libs/1_73_0/libs/regex/doc/html/boost_regex/ref/internals/uni_iter.html) – Botje Aug 06 '20 at 08:29
  • 4
    Also note that it is not true that 1 Unicode Code Point = 1 Character, espacially on emojies. There are grapheme clusters, that take up more Unicode Characters. E.g `‍❤️‍` is 5 unicode characters (a male face, a heart and a female face, joined by ZWJs), or flags, consisting of `U+1F3F4` Waving Flag, 2-5 CLDR characters idicating the country or region, and ` U+E007F` – king_nak Aug 06 '20 at 09:09
  • Even through some emojis are 5 unicode characters for what I am doing this works perfectly – DaStrangeBoi Aug 06 '20 at 12:38