0

I have a stream of words that give me a word in each run of the loop as std::string. But ideally this should be std::wstring. So after I obtain the string I convert it to std::wstring. This I input into a std:wstringstream. Finally, after all words from the stream are processed,and then I convert the std:wstringstream into a std::wstring, and then search for the required term (originally a std::wstring) in it. This is my code:

while (stream)
{
    std::string word = stream->getWord();
    boost::trim(word);    

    std::wstring longWord(word.length(), L' '); // Make room for characters
    std::copy(word.begin(), word.end(), longWord.begin());

    fMyWideCharStream << longWord;
    stream->next();
}

std::wstring fContentString = fMyWideCharStream.str();

size_t nPos = fContentString.find(fSearchString, 0); //fSearchString is std::wstring

while(nPos != std::wstring::npos)
    {
        qDebug() << "Pos: " << nPos << endl;
        nPos = fContentString.find(fSearchString, nPos+1);
    }

I have this string: Passive Aggressive Dealing With Passive Aggression, Lost Happiness & Disconnection Copyright © 2014, where the © is a wide character. As std::string it takes up two positions. As std::wstring it takes 1, which is what I want. However, on trying fSearchString with a value of L"2014", I am still getting a value of 96, whereas it should be 95 since this string is now std::wstring.

Any idea what I should do to fix this?

SexyBeast
  • 7,913
  • 28
  • 108
  • 196

1 Answers1

1

Because the original string is not ASCII-only - it contains multibyte character '©', it is wrong to convert from string to wstring using character-by-character conversion. Therefore both

std::wstring longWord(word.length(), L' '); // Make room for characters
std::copy(word.begin(), word.end(), longWord.begin());

and

std::wstring longWord(word.begin(), word.end());

do not work for a string containing multi-byte characters. To properly convert from multibyte-character string to wstring on Windows you could use mbstowcs(): http://www.cplusplus.com/reference/cstdlib/mbstowcs/

In a platform-independent way, with C++11 (compile options to clang: -std=c++1 ) you can do this: https://stackoverflow.com/a/14809553/1915854 , https://stackoverflow.com/a/18597384/1915854

Example if you need characters beyond what a single wchar_t can store:

std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::wstring longWord = converter.from_bytes(word);

If you don't need characters beyond what a single wchar_t can store:

std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
std::wstring longWord = converter.from_bytes(word);

Necessary includes:

#include <locale>
#include <codecvt>
#include <string>

There seem used to be other options prior to C++11 in Boost.

Community
  • 1
  • 1
Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
  • I assume U16 and U8 refer to `std::wstring` and `std::string` respectively? What compiler options do I need to provide? – SexyBeast Jul 01 '15 at 19:22
  • @Cupidvogel, I've updated my answer with code example particular to your case. Please, try. – Serge Rogatch Jul 01 '15 at 19:34
  • So this will ensure that the `wstringstream` truly consists of long words. Will converting it to `std::wstring` through the `str()` method keep it safe? – SexyBeast Jul 01 '15 at 19:39
  • What do you mean by "keeping it safe"? I expect that `str()` method of `wstringstream` returns a `wstring` containing what was in the `wstringstream`. – Serge Rogatch Jul 01 '15 at 19:43
  • Yep, yep, I mean that only. Gonna try it now. – SexyBeast Jul 01 '15 at 19:45
  • Works like a charm! Thanks! One more question, I haven't has the opportunity of testing it out in Windows yet, will this work on Windows as well for sure? – SexyBeast Jul 01 '15 at 21:45
  • @Cupidvogel, on Windows I am getting e.g. std::range_error exception in case the `string` is not UTF-8, but in some ASCII code page e.g. Windows-1251 (Cyrillic). – Serge Rogatch Jul 02 '15 at 06:06
  • In my case, the words obtained are guaranteed to be `std::string`. Will there be a problem, then? – SexyBeast Jul 02 '15 at 06:23
  • If the `std::string` contains UTF-8 characters, there should be no problem. But if the string is in ASCII code page, you can get `std::range_error` exception. I don't know for sure, thus you can ask another question if something is still not clear to you. – Serge Rogatch Jul 02 '15 at 07:33