4

How can I get a substring of a std::wstring which includes some non-ASCII characters?

The following code does not output anything:
(The text is an Arabic word contains 4 characters where each character has two bytes, plus the word "Hello")

#include <iostream>
#include <string>

using namespace std;

int main()
{
    wstring s = L"سلام hello";
    wcout << s.substr(0,3) << endl;
    wcout << s.substr(4,5) << endl;

    return 0;
}
Zig Razor
  • 3,381
  • 2
  • 15
  • 35
MBZ
  • 26,084
  • 47
  • 114
  • 191
  • The second should at least print " hell", and does on Coliru. The first might not be printable on the console you're supposedly using. – chris Aug 19 '13 at 22:06
  • yeah, that's the strange part. I'm getting nothing. – MBZ Aug 19 '13 at 22:09
  • What OS are you running this code on? – Matteo Italia Aug 19 '13 at 22:14
  • AFAIK the console has limited support for Unicode (due to a mix of CRT weirdness and limits of the selection of fonts available for the console), but YMMV. – Matteo Italia Aug 19 '13 at 22:20
  • 1
    In particular, first setup the whole thing correctly to print Unicode on the console (see http://stackoverflow.com/questions/2492077/output-unicode-strings-in-windows-console-app and http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx, and http://blog.wolffmyren.com/2009/02/26/necessary-criteria-for-fonts-to-be-available-in-a-command-window/ if the default font does not have the glyphs you need), *then* do your experiments with substrings and whatever. – Matteo Italia Aug 19 '13 at 22:24
  • This question is not about UTF-8. – Adrian McCarthy Aug 19 '13 at 23:20
  • Have you used the debugger? Your question is worded as though the problem is getting the substring, and all the comments are saying it might just be writing to console that's the issue. By putting the substrings into local wstrings, you should be able to establish which is the issue and edit the question accordingly. – Kate Gregory Aug 19 '13 at 23:24
  • if anyone is finding split a wstring, see https://stackoverflow.com/questions/36812132/splitting-stdwstring-into-stdvector – yu yang Jian Apr 20 '21 at 13:31

1 Answers1

2

This should work: live on Coliru

#include <iostream>
#include <string>
#include <boost/regex/pending/unicode_iterator.hpp>

using namespace std;

template <typename C>
std::string to_utf8(C const& in)
{
    std::string result;
    auto out = std::back_inserter(result);
    auto utf8out = boost::utf8_output_iterator<decltype(out)>(out);

    std::copy(begin(in), end(in), utf8out);
    return result;
}

int main()
{
    wstring s = L"سلام hello";

    auto first  = s.substr(0,3);
    auto second = s.substr(4,5);

    cout << to_utf8(first)  << endl;
    cout << to_utf8(second) << endl;
}

Prints

سلا
 hell

Frankly though, I think your substring calls are making weird assumptions. Let me suggest a fix for that in a minute:

sehe
  • 374,641
  • 47
  • 450
  • 633