Unicode string handling in C++

Question

I went through enough of threads and post on this topic but somehow its not helping me add unicode support to my code. I have very simple task to do - read the Unicode file (.txt and csv) - Parse it and store the word as tokens in 2D array using some delimiters (, or " separated words) - Perform some operations on it - store these strings text file

Problem i am facing is some of my older code functions are not compatible i guess as i don't find substitute or i am able to compile them but no out put generated. This code works perfectly fine with ASCII but now i need unicode support for it.

It would be great if i get sample source code ,does not need to be whole big code but at least like how to get Unicode file parse it and store it in token and which functions to use for comparison etc,

I am pasting some part of code below , i did modify few things so may not compile in first go.

get the text file as input e.g. profiles.txt which is in unicode (UTF 16 - basically Chinese and Korean words in it)

// adding all std headers here


const int MAX_CHARS_PER_LINE = 4072;  
const int MAX_TOKENS_PER_LINE = 1;      
const wchar_t* const DELIMITER = L"\"";

class IntegrityCheck
{
    public:
        std::wstring Profile_Container[5000][4];
        void Profile_PRD_Parser();
};

 void IntegrityCheck::Profile_PRD_Parser()
{

std::wstring skip (L".exe");
std::wstring databoxtemp[1][1];
int a=-1;

// create a file-reading object
wifstream fin.open("profiles.txt");  //open a file
wofstream fout("out.txt");  // this dumps the parsing ouput 

// read each line of the file
while (!fin.eof())
{
    // read an entire line into memory
    wchar_t buf[MAX_CHARS_PER_LINE];

    fin.getline(buf, MAX_CHARS_PER_LINE);

    // parse the line into blank-delimited tokens
    int n = 0; // a for-loop index

    // array to store memory addresses of the tokens in buf
    const wchar_t* token[MAX_TOKENS_PER_LINE] = {}; // initialize to 0

    // parse the line
    token[0] = wcstok(buf, DELIMITER); // first token

    if (token[0]) // zero if line is blank
    {

        for (n = 0; n < MAX_TOKENS_PER_LINE; n++)   // setting n=0 as we want to ignore the first token
        {
            oken[n] = wcstok(0, DELIMITER); // subsequent tokens

            if (!token[n]) break; // no more tokens

            std::wstring str2 =token[n];

            std::size_t found = str2.find(str);  //substring comparison

            if (found!=std::string::npos)   // if its exe then it writes in Dxout for same app name on new line
            {
                a++;
                Profile_Container[a][0]=token[n];
                std::transform(Profile_Container[a][2].begin(), Profile_Container[a][2].end(), Profile_Container[a][2].begin(), ::tolower);  //convert all data to lower 

                fout<<Profile_Container[a][0]<<"\t"<<Profile_Container[a][1]<<"\t"<<Profile_Container[a][2]<<"\n"; //write to file
            }

        }
    }

}

fout.close();
fin.close();
}

int main()
{
IntegrityCheck p1;
p1.Profile_PRD_Parser();
}

There's a typo, the word is spelled "Integrity", not "Intigrity". — Frerich Raabe, Dec 02 '13 at 09:25
If you use already `using namespace std;` then theres no reason to also write using `std::cout;` and so on. You're using already the whole std namespace. — Constantin, Dec 02 '13 at 09:28
Just remove the `using namespace std` line. It does not "add all std headers". I would not recommend using it if you know what it does, but that comment betrays that you don't know what it does, so I I have to make an even stronger recommendation to not use that. — R. Martinho Fernandes, Dec 02 '13 at 09:39
First thing is to remove **every** mention of `char`. Don't cast to char when calling getline, use wcstok not strtok. — john, Dec 02 '13 at 09:41
"now i need unicode support for it." is not a good description of a problem. What do you want to do with the data? How do you expect the input to be encoded? What platform is this? (`wsomething` does not magically make things "support Unicode") — R. Martinho Fernandes, Dec 02 '13 at 09:43
@john why would that be? (hint: `char` can be used just fine to "support Unicode"; it all depends on what "support Unicode" means) — R. Martinho Fernandes, Dec 02 '13 at 09:44
@R.MartinhoFernandes I am assuming that his data is UTF-16 (or similar). If that's not true then he has more work to do. — john, Dec 02 '13 at 09:45
@john Don't cast to char when calling getline -- i was getting error that i need to cast it to char and just cant use buf so i did for getline. i understand i have hell lot of mistakes in code regarding unicode as i am trying it for first time. — NxC, Dec 02 '13 at 09:48
@john is right, although not in **assuming** the data is UTF-16. One can also assume the data is UTF-8, or Unicode-16BE or LE. Or, indeed, a host of other encodings -- all "Unicode". The OP **must** clear this up first. — Jongware, Dec 02 '13 at 09:50
@R. Martinho Fernandes now i need unicode support for it." is not a good description of a problem. What do you want to do with the data? How do you expect the input to be encoded? What platform is this? ---- i have text/csv file that is encoded in unicode and i need to parse it and do some operations, and store, i can not convert it as it will loose some data, the data is mostly Chinese or Korean characters. platform is windows — NxC, Dec 02 '13 at 09:52
@Jongware I agree, but the phrase 'Unicode file' made me think that he has 16-bit chars. Plus all the effort made so far in converting to use wide chars. — john, Dec 02 '13 at 09:52
@NeileshC 'encoded in Unicode' is a meaningless phrase. Unicode can be encoded in UTF-8, UTF-16 etc. etc. Unicode is a *character set* not an *encoding*. Unicode can be encoded in multiple ways. I don't doubt your file is Unicode, but how it is encoded is not clear. However given that your platform is Windows UTF-16 does seem the most likely (as I suspected). — john, Dec 02 '13 at 09:53
@john yes its UTF16 , I accept my unicode knbase is very small. — NxC, Dec 02 '13 at 09:56

score 0 · Answer 1 · edited May 23 '17 at 11:49

0

Looking quickly over your code the only changes I see are

const wchar_t* const DELIMITER = L"\"";

fin.getline(buf, MAX_CHARS_PER_LINE);

token[0] = wcstok(buf, DELIMITER);

std::transform(Profile_Container[a][2].begin(), Profile_Container[a][2].end(), Profile_Container[a][2].begin(), ::towlower);

Not sure that towlower will be able to convert every Unicode character to lower case, but if your text is Chinese and Korean I guess that's not so much of an issue.

EDIT

The following is necessary on Windows with Visual Studio 2010

#include <codecvt>
#include <locale>

wifstream fin("profiles.txt", ios_base::binary);  //open a file
fin.imbue(std::locale(fin.getloc(),
   new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));

This worked for me with a file encoded in UTF-16 'big endian' (but not in little endian).

The only problem with your current code is the file reading (and maybe writing I haven't looked at that). Once you can get your characters into strings from the file it should be OK.

If the above doesn't work for you then I'm not sure. This page has the gory details.

edited May 23 '17 at 11:49

Community

1
1

answered Dec 02 '13 at 10:11

john

85,011
4
57
81

,Thanks John for being more specific, I would make changes and see how it goes and update to thread – NxC Dec 02 '13 at 16:42
i made corrections in code, its comipling but i am missing something, basically getline gets the line from the unicode file and then i try to break it in tokens (using delimiters) but i see that getline gets everything in binary and henc it fails to break the buffer into tokens, and also the comparison fails and i get blank output, do i need toconvert it back to ASCII then?but it will loose data right? so how should i approach this? i wrote all logic keeping simple ASCII strings in mind and now its making difficult. any suggestions to make it working more than welcome – NxC Dec 03 '13 at 17:47
Or do you suggest alternate way to do this? i searched but could not find related article or sample code on unicode text file parsing – NxC Dec 03 '13 at 18:12
I am really stuck with this unicode now, can some one please help me with it? – NxC Dec 03 '13 at 23:47
@NeileshC I tried your code and was surprised that it didn't work (for me). The problem is that just using wchar_t is not enough to tell the compiler that your file is UTF-16. There doesn't seem to be any completely platform independent way to do this so the way to proceed depends on your compiler etc. I've updated the answer above with some code that worked for me. – john Dec 04 '13 at 08:06
Thanks John, looks like codecvt does not support on visual studio 2008 , i need to get VS2010 , i will start downloading from my student microsoft account. and surprisingly Code::block (mingw) also does not support codecvt or have similar lib. – NxC Dec 04 '13 at 22:14
i got it working , with below code i get only charector by charector and its in hexadecimal. how do i put it in as string or write back to file? I tried pushing them in buffer or using wofstream to write but when i write it gives some garbage value. for(wchar_t c; fin.get(c); ) std::cout << std::showbase << std::hex << c << '\n'; – NxC Dec 04 '13 at 23:48
Ok i was able to scan the file and then write back to some other unicode file, but now when i write the file i see space added between every char, how do i remove it? and i need to get back this in string so i can use this in my code for further calculation, atlast i am seeing some values now flowing through. Thanks @john wchar_t buf[MAX_CHARS_PER_LINE]; fin.imbue(std::locale(fin.getloc(),new std::codecvt_utf16)); fout.imbue(std::locale(fin.getloc(),new std::codecvt_utf16)); for(wchar_t c; fin.get(c); ) fout< – NxC Dec 05 '13 at 00:31
Ohh yes i could ommit the spaces while writing back the file by opening it in binary mode, I am all doing on trial and error basis , its so confusing. but now problem is, i need these values in string / bufffer so that i can carry out my further operations on it. any suggestions on how can i store the line in single buffer and then perform regular string comparison etc operations( i am just hoping the wstring again does not screw up and need to search for alternatives for the new functions) – NxC Dec 05 '13 at 02:01
@NeileshC Reading wide chars one at a time and adding to a wstring or wchar_t[] should work. – john Dec 05 '13 at 06:32
i did that , but then wcstok isnt working on it, i keep getting either blank file or i get some exception from library. wchar_t wcs[10000]; int i=0; for(wchar_t c; fin.get(c); ) {if(i<10000){wcs[i]=c;i++;}} const wchar_t* token[MAX_TOKENS_PER_LINE] = {}; token[0] = wcstok(wcs,DELIMITER); // first token int n=0; if (token[0]) {for (n = 0; n < MAX_TOKENS_PER_LINE; n++) { token[n] = wcstok(0, DELIMITER); fout< – NxC Dec 05 '13 at 06:44
With current code i could get the entire file in one wstring , but now wcstok isnt doing well here, somehow its not able to break the unicode line – NxC Dec 05 '13 at 06:46
@NeileshC wcstok worked for me. I wonder if you have a byte ordering issue (aka endianess issue). Are you sure the bytes in your wchar_t are in the correct order? What does the wstring look like in the debugger? – john Dec 05 '13 at 08:05
how do i determine the endianness? btw if i output the wstring to text file i get the correct output(i.e. i get the same readable string) , do you mind if you could paste your exact code? i can tell u then the exact errors. also for u to quick check , you can create text file and save it as unicode encoding and pass it with some text. – NxC Dec 05 '13 at 08:15
i am really stuck at this first step of tokenizing, i have long way to go after this and i am running out of time. if i get tokens correctly it would be nothing like that. – NxC Dec 05 '13 at 08:16
i tried again replacing the code above you put in answer, it does not work for me.e.g. fin.getline(buf,MAX) gives error as no overloaded operator , conversion error bet wstring and wchar_t etc , so can you put your current running code? – NxC Dec 05 '13 at 08:54
,ok i could complile and run , i see the issue that the delimeter isnt working. then token[0] has the same value as buffer, code does not go inside string comparison as it returns break at "if (!token[n]) break; // no more tokens" and if i try to write value of any token inside loop, it throws access violation exception. – NxC Dec 05 '13 at 09:07
I see ---token[n] = wcstok(0, DELIMITER); // subsequent tokens--- screws up the whole session, after this step the token[n] becomes bad pointer and debugger says expression can not be evaluated. and just before this step when i see in debugger i see all chinese or unreadable charectors which are same as in buffer, but if we print them its readble. @john any clue here? – NxC Dec 05 '13 at 09:16
@NeileshC Sorry don't have my code available right now. It does sounds like endianess, if the data from the file has the wrong endianess, so when you do wcstok you don't find anything. It's easy to determine endianess, remember chars are just numbers, so the correct number for a double quote is 34 (0022 in hex). If you have the wrong endianess it will be 8704 (2200 in hex). So when you read a double quote from the file check it's *numeric* value and see if it's correct. – john Dec 05 '13 at 09:40
Remember what you are trying to do is conceptually simple. It's just numbers, and Unicode is just bigger numbers than ASCII. But don't assume. Don't assume that reading a string from a file will just work, don't assume that if you can write out a string it has worked (equally don't assume that just because you can't it hasn't worked). The problem is that when you go to Unicode there are multiple ways of interpreting character data, but underneath it all, it's just numbers. So check the numbers. – john Dec 05 '13 at 09:45
,so when i check values setting breakpoint, i see DELIMETER (") has value 34, it means its not endianess problem. I searched a bit around for bad pointer error, it seems i am missing something or something not declared correctly, so if everything is running fine for you , it might help to compare the code and see what i did wrong. – NxC Dec 05 '13 at 09:48
@NeileshC **NO!!**, what did I say about assumptions? DELIMITER is a string defined by C++ code. So of course it has the correct endianess. It's the endianess of characters read from your file that you must check. – john Dec 05 '13 at 09:58
@NeileshC Unfortunately comparing code is not going to help (in any case it's as the code is above). The problem is your file, I had my code working on my file (which was UTF-16 big endian as I said). But what is your file? You said it's Unicode, and that's it's UTF-16 but you haven't said it's endianess. And there's further issues like whether it starts with a BOM (byte order mark) or not. These are the important questions. – john Dec 05 '13 at 10:12
@NeileshC One thing I'd highly recommend is that you get yourself a text editor which has some understanding of Unicode. This will help you understand what you actually have in your input and output files. As you've probably figured out by now that they 'look' right (or wrong) doesn't tell you much. I use Notepad++ (which is free). It will certainly tell you what endianess your file is. – john Dec 05 '13 at 10:17
,thanks alot John, i finally figured out ,its little Endian and it was the root cause of all of the problems, i just fixed it in file reading object and it fixed everything, You did said correct about the file type, but i was not sure how do i check, but finally its here, I must thank you john for helping and giving constant feedback over it. for those in future if need to know what fixed, here it is: fout.imbue(std::locale(fin.getloc(), new std::codecvt_utf16)); – NxC Dec 05 '13 at 10:25
@NeileshC No problem. For future reference it would be worth promoting that comment to an answer. You are allowed to answer your own question. – john Dec 05 '13 at 10:27

score 0 · Accepted Answer · answered Dec 05 '13 at 10:27

Final code that comipled and ran:

fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff,std::codecvt_mode(std::little_endian|std::consume_header)>));
fout.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff,std::codecvt_mode(std::little_endian|std::consume_header)>));


while (!fin.eof())
{

 wchar_t buf[MAX_CHARS_PER_LINE];

 fin.getline(buf, MAX_CHARS_PER_LINE);

 wchar_t* token[MAX_TOKENS_PER_LINE] = {};
token[0] = wcstok(buf, DELIMITER);


if (token[0]) // zero if line is blank
{
    int n = 0; 
    for (n = 0; n < MAX_TOKENS_PER_LINE; n++)   // setting n=0 as we want to ignore the first token
    {
        token[n] = wcstok(0, DELIMITER); // subsequent tokens


        if (!token[n]) break; // no more tokens

        std::wstring str2 =token[n];

        std::size_t found = str2.find(str);  //substring comparison

        if (found!=std::string::npos)   // if its exe then it writes in Dxout for same app name on new line
        {  
            a++;
            Profile_Container[a][0]=token[n];
            fout<<Profile_Container[a][0];
        }
    }
}

}

Unicode string handling in C++

I am pasting some part of code below , i did modify few things so may not compile in first go.

2 Answers2