How to search for text in wide string obtained from wstringstream

Question

I have a stream of words that give me a word in each run of the loop as std::string. But ideally this should be std::wstring. So after I obtain the string I convert it to std::wstring. This I input into a std:wstringstream. Finally, after all words from the stream are processed,and then I convert the std:wstringstream into a std::wstring, and then search for the required term (originally a std::wstring) in it. This is my code:

while (stream)
{
    std::string word = stream->getWord();
    boost::trim(word);    

    std::wstring longWord(word.length(), L' '); // Make room for characters
    std::copy(word.begin(), word.end(), longWord.begin());

    fMyWideCharStream << longWord;
    stream->next();
}

std::wstring fContentString = fMyWideCharStream.str();

size_t nPos = fContentString.find(fSearchString, 0); //fSearchString is std::wstring

while(nPos != std::wstring::npos)
    {
        qDebug() << "Pos: " << nPos << endl;
        nPos = fContentString.find(fSearchString, nPos+1);
    }

I have this string: Passive Aggressive Dealing With Passive Aggression, Lost Happiness & Disconnection Copyright © 2014, where the © is a wide character. As std::string it takes up two positions. As std::wstring it takes 1, which is what I want. However, on trying fSearchString with a value of L"2014", I am still getting a value of 96, whereas it should be 95 since this string is now std::wstring.

Any idea what I should do to fix this?

As an aside, your two-line copy is almost equivalent to `std::wstring longWord(word.begin(), word.end());` — chris, Jul 01 '15 at 18:51
Calling `find` on a `std::wstring` will invoke that automatically, I presume? — SexyBeast, Jul 01 '15 at 18:53
@Cupidvogel `std::wstring::find` is invoked *exactly* by calling `find()` on a `std::wstring` — набиячлэвэли, Jul 01 '15 at 18:56
Yes, thought so. Then why the above comment by @ThomasMatthews? — SexyBeast, Jul 01 '15 at 18:58
Are you sure that the contents of `fContentString` match what is in your question? If I just copy your string and call `find()` it works [here](http://coliru.stacked-crooked.com/a/c78c6703f2905fbb) — NathanOliver, Jul 01 '15 at 19:05
Have you checked that `std::copy(word.begin(), word.end(), longWord.begin());` does what you think it does? Since `std::copy` is a general purpose algorithm, it seems unlikely that it will suddenly decide to invoke a multibyte-to-wide-character conversion. — rici, Jul 01 '15 at 19:05
@Cupidvogel: Sorry, I was going by your question's title on *"How to search..."*. Looks like your question is not about searching. — Thomas Matthews, Jul 01 '15 at 19:07
@NathanOliver, yes I know it will work in a direct string. The problem is how to make it work from the stream. — SexyBeast, Jul 01 '15 at 19:07
@rici, I found this here: http://stackoverflow.com/a/6691597/1469954 — SexyBeast, Jul 01 '15 at 19:08
Which compiler do you use? It is definitely wrong to convert string to wstring this way if the original string contains anything except ASCII (e.g. UTF8 characters). — Serge Rogatch, Jul 01 '15 at 19:10
@Cupidvogel: That code only widens characters. It does not do multibyte to wide character conversion. — rici, Jul 01 '15 at 19:16
By the way, why do you call `stream->next();` twice in the loop? Is it not a mistake? — Serge Rogatch, Jul 01 '15 at 20:04
No no, that is a typo here, it works fine in my actual code. Editing it.. :) — SexyBeast, Jul 01 '15 at 20:10

score 1 · Accepted Answer · edited May 23 '17 at 12:06

1

Because the original string is not ASCII-only - it contains multibyte character '©', it is wrong to convert from string to wstring using character-by-character conversion. Therefore both

std::wstring longWord(word.length(), L' '); // Make room for characters
std::copy(word.begin(), word.end(), longWord.begin());

and

std::wstring longWord(word.begin(), word.end());

do not work for a string containing multi-byte characters. To properly convert from multibyte-character string to wstring on Windows you could use mbstowcs(): http://www.cplusplus.com/reference/cstdlib/mbstowcs/

In a platform-independent way, with C++11 (compile options to clang: -std=c++1 ) you can do this: https://stackoverflow.com/a/14809553/1915854 , https://stackoverflow.com/a/18597384/1915854

Example if you need characters beyond what a single wchar_t can store:

std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::wstring longWord = converter.from_bytes(word);

If you don't need characters beyond what a single wchar_t can store:

std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
std::wstring longWord = converter.from_bytes(word);

Necessary includes:

#include <locale>
#include <codecvt>
#include <string>

There seem used to be other options prior to C++11 in Boost.

edited May 23 '17 at 12:06

Community

1
1

answered Jul 01 '15 at 19:20

Serge Rogatch

13,865
7
86
158

I assume U16 and U8 refer to `std::wstring` and `std::string` respectively? What compiler options do I need to provide? – SexyBeast Jul 01 '15 at 19:22
@Cupidvogel, I've updated my answer with code example particular to your case. Please, try. – Serge Rogatch Jul 01 '15 at 19:34
So this will ensure that the `wstringstream` truly consists of long words. Will converting it to `std::wstring` through the `str()` method keep it safe? – SexyBeast Jul 01 '15 at 19:39
What do you mean by "keeping it safe"? I expect that `str()` method of `wstringstream` returns a `wstring` containing what was in the `wstringstream`. – Serge Rogatch Jul 01 '15 at 19:43
Yep, yep, I mean that only. Gonna try it now. – SexyBeast Jul 01 '15 at 19:45
Works like a charm! Thanks! One more question, I haven't has the opportunity of testing it out in Windows yet, will this work on Windows as well for sure? – SexyBeast Jul 01 '15 at 21:45
@Cupidvogel, on Windows I am getting e.g. std::range_error exception in case the `string` is not UTF-8, but in some ASCII code page e.g. Windows-1251 (Cyrillic). – Serge Rogatch Jul 02 '15 at 06:06
In my case, the words obtained are guaranteed to be `std::string`. Will there be a problem, then? – SexyBeast Jul 02 '15 at 06:23
If the `std::string` contains UTF-8 characters, there should be no problem. But if the string is in ASCII code page, you can get `std::range_error` exception. I don't know for sure, thus you can ask another question if something is still not clear to you. – Serge Rogatch Jul 02 '15 at 07:33

How to search for text in wide string obtained from wstringstream

1 Answers1