1

I want to remove the emojis from a json Telegram bot update parsed with boost property tree

I tried to use the regex pattern from this answer and a few others but I'm not sure how to get them to work in C++ (the below causes a crash): https://stackoverflow.com/a/24674266/2212021

"message":{
   "message_id":123,
   "from":{
      "id":12345,
      "first_name":"name",
      "username":"username"
   },
   "chat":{
      "id":12345,
      "first_name":"name",
      "username":"username",
      "type":"private"
   },
   "date":1478144459,
   "text":"this is \ud83d\udca9 a sentence"
}
BOOST_FOREACH(const boost::property_tree::ptree::value_type& child, jsontree.get_child("result"))
{

        std::string message(child.second.get<std::string>("message.text").c_str());

        boost::regex exp("/[\u{1F600}-\u{1F6FF}]/");
        std::string message_clean = regex_replace(message, exp, "");

        ...
}

Exception thrown at 0x00007FF87C1C7788 in CrysisWarsDedicatedServer.exe: Microsoft C++ exception: boost::exception_detail::clone_impl > at memory location 0x000000001003F138. Unhandled exception at 0x00007FF87C1C7788 in CrysisWarsDedicatedServer.exe: Microsoft C++ exception: boost::exception_detail::clone_impl > at memory location 0x000000001003F138.

Community
  • 1
  • 1
JFB
  • 13
  • 4
  • This question could use some more details: I'm presuming `regex` is actually `boost::regex`. What do you mean when you can't get it to work? Does it compile? Does it crash while running? Or does `message_clean` not give you the desired result (e.g. all the emojis are still in the string)? Have you tried using a simpler regex (e.g. remove all numbers) to ensure `regex_replace` is working correctly? What version of `boost` are you using? – Tas Nov 03 '16 at 22:59
  • Apologies; yes, boost::regex. The problem is that it crashes and the version is boost 1.55 – JFB Nov 04 '16 at 00:39
  • How does it crash, where? – sehe Nov 04 '16 at 10:16
  • Had a go at attaching to the process. edited main post – JFB Nov 04 '16 at 12:22

1 Answers1

1

The first problem is using .c_str() on a byte array with arbitrary text encoding. There's no need, so don't do it.

Next, '\u' is not a valid C++ character escape. Did you mean '\\u'?

Finally, make sure Boost Regex is compiled with Unicode support and use the appropriate functions.

After spending some time with those documentation pages and also

I came up with

Live On Wandbox

//#define BOOST_HAS_ICU
#include <boost/property_tree/json_parser.hpp>
#include <boost/regex.hpp>
#include <boost/regex/icu.hpp>
#include <iostream>
std::string asUtf8(icu::UnicodeString const& ustr);

std::string sample = R"(
{
    "message":{
       "message_id":123,
       "from":{
          "id":12345,
          "first_name":"name",
          "username":"username"
       },
       "chat":{
          "id":12345,
          "first_name":"name",
          "username":"username",
          "type":"private"
       },
       "date":1478144459,
       "text":"this is \ud83d\udca9 a sentence"
    }
}
)";

int main() {

    boost::property_tree::ptree pt;
    {
        std::istringstream iss(sample);
        read_json(iss, pt);
    }
    auto umessage       = icu::UnicodeString::fromUTF8(pt.get("message.text", ""));
    boost::u32regex exp = boost::make_u32regex("\\p{So}");

    auto clean = boost::u32regex_replace(umessage, exp, UnicodeString::fromUTF8("<symbol>"));

    std::cout << asUtf8(clean) << "\n";
}

std::string asUtf8(icu::UnicodeString const& ustr) {
    std::string r;
    {
        icu::StringByteSink<std::string> bs(&r);
        ustr.toUTF8(bs);
    }

    return r;
}

This prints:

this is <symbol> a sentence
sehe
  • 374,641
  • 47
  • 450
  • 633