How to split a string by emojis in C++

Question

I'm trying to take a string of emojis and split them into a vector of each emoji

Given the string:

std::string emojis = "";

I'm trying to get:

std::vector<std::string> splitted_emojis = {"", "", "", "", "", "", "", ""};

Edit

I've tried to do:

std::string emojis = "";
std::vector<std::string> splitted_emojis;
size_t pos = 0;
std::string token;
while ((pos = emojis.find("")) != std::string::npos)
{
    token = emojis.substr(0, pos);
    splitted_emojis.push_back(token);
    emojis.erase(0, pos);
}

But it seems like it throws terminate called after throwing an instance of 'std::bad_alloc' after a couple of seconds.
When trying to check how many emojis are in a string using:

std::string emojis = "";
std::cout << emojis.size() << std::endl; // returns 32

it returns a bigger number which i assume are the unicode data. I don't know too much about unicode data but i'm trying to figure out how to check for when the data of an emoji begins and ends to be able to split the string to each emoji

@d4rk4ng31 VSCode is a fully capable editor (and *the* most popular according to the SO developer survey). You just need to set it up with a debugger. — eesiraed, Aug 06 '20 at 05:42
use a Unicode library instead. C++ stdlib doesn't have good support Unicode support and can't know the UTF-8 character boundaries — phuclv, Aug 06 '20 at 05:43
"I don't know too much about unicode data", so read the classic [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) — Botje, Aug 06 '20 at 07:21

Botje · Accepted Answer · 2020-08-06T08:11:43.907

I would definitely recommend that you use a library with better unicode support (all large frameworks do), but in a pinch you can get by with knowing that the UTF-8 encoding spreads Unicode characters over multiple bytes, and that the first bits of the first byte determine how many bytes a character is made up of.

I stole a function from boost. The split_by_codepoint function uses an iterator over the input string and constructs a new string using the first N bytes (where N is determined by the byte count function) and pushes it to the ret vector.

// Taken from boost internals
inline unsigned utf8_byte_count(uint8_t c)
{
  // if the most significant bit with a zero in it is in position
  // 8-N then there are N bytes in this UTF-8 sequence:
  uint8_t mask = 0x80u;
  unsigned result = 0;
  while(c & mask)
  {
    ++result;
    mask >>= 1;
  }
  return (result == 0) ? 1 : ((result > 4) ? 4 : result);
}

std::vector<std::string> split_by_codepoint(std::string input) {
  std::vector<std::string> ret;
  auto it = input.cbegin();
  while (it != input.cend()) {
    uint8_t count = utf8_byte_count(*it);
    ret.emplace_back(std::string{it, it+count});
    it += count;
  }
  return ret;
}

int main() {
    std::string emojis = u8"";
    auto split = split_by_codepoint(emojis);
    std::cout << split.size() << std::endl;
}

Note that this function simply splits a string into UTF-8 strings containing one code point each. Determining if the character is an emoji is left as an exercise: UTF-8-decode any 4-byte characters and see if they are in the proper range.

Only if your platform has 32bit wchar_t and you can live with the wasted memory. Qt's QString uses UTF-32 internally, for example. Microsoft has 16bit wchar_t so it still needs surrogate characters to represent emoji. — Botje, Aug 06 '20 at 08:19
Alternatively, you can use a [UTF-8 iterator adaptor](https://www.boost.org/doc/libs/1_73_0/libs/regex/doc/html/boost_regex/ref/internals/uni_iter.html) — Botje, Aug 06 '20 at 08:29
Also note that it is not true that 1 Unicode Code Point = 1 Character, espacially on emojies. There are grapheme clusters, that take up more Unicode Characters. E.g `‍❤️‍` is 5 unicode characters (a male face, a heart and a female face, joined by ZWJs), or flags, consisting of `U+1F3F4` Waving Flag, 2-5 CLDR characters idicating the country or region, and ` U+E007F` — king_nak, Aug 06 '20 at 09:09
Even through some emojis are 5 unicode characters for what I am doing this works perfectly — DaStrangeBoi, Aug 06 '20 at 12:38

How to split a string by emojis in C++

Edit

1 Answers1

Linked