3

In a Linux environment, I have a piece of code for reading unicode files, similar as shown below.

However, special characters (like danish letters æ, ø and å) are not handled correctly. For the line 'abcæøåabc' then output is simply 'abc'. Using a debugger I can see that the contents of wline is also only a\000b\000c\000.

#include <fstream>
#include <string>

std::wifstream wif("myfile.txt");
if (wif.is_open())
{
    //set proper position compared to byteorder
    wif.seekg(2, std::ios::beg);
    std::wstring wline;

    while (wif.good())
    {
        std::getline(wif, wline);
        if (!wif.eof())
        {
            std::wstring convert;
            for (auto c : wline)
            {
                if (c != '\0')
                convert += c;
            }
        }
    }
}
wif.close();

Can anyone tell me how I get it to read the whole line?

Thanks and regards

Natan Streppel
  • 5,759
  • 6
  • 35
  • 43
Jon Helt-Hansen
  • 383
  • 1
  • 5
  • 19
  • Do a hex dump of the file, what does it contain? – Mark Ransom Sep 30 '14 at 15:34
  • I got the following hex dump `0000000: fffe 6100 6200 6300 e600 f800 e500 6100 ..a.b.c.......a. 0000010: 6200 6300 0d00 0a00 b.c.....` Even though the characters themselves are not shown in the output above at least the hex values seems correct - `fffe` for utf-16-le encoding, `e600` for æ, `f800` for ø, and `e500` for æ. – Jon Helt-Hansen Oct 01 '14 at 06:29
  • You need to use `imbue` with a UTF-16LE locale to indicate the format of the file. I tried to find a relevant guide for you but couldn't. – Mark Ransom Oct 01 '14 at 13:14
  • 1
    @MarkRansom: http://en.cppreference.com/w/cpp/locale/codecvt_utf16 shows an example. – Remy Lebeau Oct 02 '14 at 01:43

1 Answers1

10

You have to use the imbue() method to tell wifstream that the file is encoded as UTF-16, and let it consume the BOM for you. You do not have to seekg() past the BOM manually. For example:

#include <fstream>
#include <string>
#include <locale>
#include <codecvt>

// open as a byte stream
std::wifstream wif("myfile.txt", std::ios::binary);
if (wif.is_open())
{
    // apply BOM-sensitive UTF-16 facet
    wif.imbue(std::locale(wif.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));

    std::wstring wline;
    while (std::getline(wif, wline))
    {
        std::wstring convert;
        for (auto c : wline)
        {
            if (c != L'\0')
                convert += c;
        }
    }

    wif.close();
}
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • Thanks for the feedback. Unfortunately codecvt_utf16 is not available on my system. Instead of reading each line of the file I use fread() instead. It is a little more cumbersome and not as neat, but it works. Your solution would be the preferred one, but as said it doesn't seem possible for me – Jon Helt-Hansen Oct 03 '14 at 08:09
  • Which compiler are you using? You are using `auto`, which is a C++11 feature, and [`codecvt_utf16`](http://en.cppreference.com/w/cpp/locale/codecvt_utf16) is part of C++11. Did you add `#include ` to your code? – Remy Lebeau Oct 03 '14 at 16:33
  • I am using gcc4.9 on Ubuntu 14.04. I am not able to include as it is not available. I only have access to `codecvt`, `codecvt_base` and `codecvt_byname` – Jon Helt-Hansen Oct 06 '14 at 09:22
  • Why do you open the file in the binary mode? – hkBattousai Apr 01 '16 at 09:22
  • For future readers, note that [`std::codecvt_utf16`](https://en.cppreference.com/w/cpp/locale/codecvt_utf16) have been deprecated from the C++17 standard. It may be removed from a future standard. – Some programmer dude Apr 29 '19 at 18:11
  • @Someprogrammerdude if it's been deprecated, then it must be replaced with something better. If you know that better way, it would make a great additional answer to this question. – Mark Ransom May 16 '21 at 00:45
  • 1
    @MarkRansom the classes in `` are indeed deprecated as of C++17, but there are no standard replacements for them at this time. I think the standards committee is pushing people to use external Unicode libraries. – Remy Lebeau May 16 '21 at 00:57