UCS-2LE text file parsing

Question

I have a text file which was created using some Microsoft reporting tool. The text file includes the BOM 0xFFFE in the beginning and then ASCII character output with nulls between characters (i.e "F.i.e.l.d.1."). I can use iconv to convert this to UTF-8 using UCS-2LE as an input format and UTF-8 as an output format... it works great.

My problem is that I want to read in lines from the UCS-2LE file into strings and parse out the field values and then write them out to a ASCII text file (i.e. Field1 Field2). I have tried the string and wstring-based versions of getline – while it reads the string from the file, functions like substr(start, length) do interpret the string as 8-bit values, so the start and length values are off.

How do I read the UCS-2LE data into a C++ String and extract the data values? I have looked at boost and icu as well as numerous google searches but have not found anything that works. What am I missing here? Please help!

My example code looks like this:

wifstream srcFile;
srcFile.open(argv[1], ios_base::in | ios_base::binary);
..
..
wstring  srcBuf;
..
..
while( getline(srcFile, srcBuf) )
{
    wstring field1;
    field1 = srcBuf.substr(12, 12);
    ...
    ...
}

So, if, for example, srcBuf contains "W.e. t.h.i.n.k. i.n. g.e.n.e.r.a.l.i.t.i.e.s." then the substr() above returns ".k. i.n. g.e" instead of "g.e.n.e.r.a.l.i.t.i.e.s.".

What I want is to read in the string and process it without having to worry about the multi-byte representation. Does anybody have an example of using boost (or something else) to read these strings from the file and convert them to a fixed width representation for internal use?

BTW, I am on a Mac using Eclipse and gcc.. Is it possible my STL does not understand wide character strings?

Thanks!

score 1 · Answer 1 · edited May 23 '17 at 12:28

Having spent some good hours tackling this question, here are my conclusions:

Reading an UTF-16 (or UCS2-LE) file is apparently manageable in C++11, see How do I write a UTF-8 encoded string to a file in Windows, in C++
Since the boost::locale library is now part of C++11, one can just use codecvt_utf16 (see bullet below for eventual code samples)
However, in older compilers (e.g. MSVC 2008), you can use locale and a custom codecvt facet/"recipe", as very nicely exemplified in this answer to Writing UTF16 to file in binary mode
Alternatively, one can also try this method of reading, though it did not work in my case. The output would be missing lines which were replaced by garbage chars.

I wasn't able to get this done in my pre-C++11 compiler and had to resort to scripting it in Ruby and spawning a process (it's just in test so I think that kind of complications are ok there) to execute my task.

Hope this spares others some time, happy to help.

score 0 · Accepted Answer · answered Aug 09 '09 at 05:54

0

substr works fine for me on Linux with g++ 4.3.3. The program

#include <string>
#include <iostream>

using namespace std;

int main()
{
  wstring s1 = L"Hello, world";
  wstring s2 = s1.substr(3,5);
  wcout << s2 << endl;
}

prints "lo, w" as it should.

However, the file reading probably does something different from what you expect. It converts the files from the locale encoding to wchar_t, which will cause each byte becoming its own wchar_t. I don't think the standard library supports reading UTF-16 into wchar_t.

answered Aug 09 '09 at 05:54

Martin v. Löwis

124,830
17
198
235

Thanks for the reply. I see the same behavior. As you say, I don't think the UTF-16 to wchar_t is supported. I used iconv to convert the file to UFT-8 and it solved by problem. – Cryptik Aug 22 '09 at 20:12
Although I'm probably addressing ghosts here, @Cryptik should mark his question as solved :) – Dr1Ku Mar 08 '13 at 12:23

UCS-2LE text file parsing

2 Answers2