8

My goal is to convert external input sources to a common, UTF-8 internal encoding, since it is compatible with many libraries I use (such as RE2) and is compact. Since I do not need to do string slicing except with pure ASCII, UTF-8 is an ideal format for me. Now, of the external input formats I should be able to decode includes UTF-16.

In order to test UTF-16 (either big-endian or little-endian) reading in C++, I converted a test UTF-8 file to both UTF-16 LE and UTF-16 BE. The file is simple gibberish in a CSV format, with many different source languages (English, French, Japanese, Korean, Arabic, Emoji, Thai), to create a reasonably complex file:

"This","佐藤 幹夫","Mêmes","친구"
"ภควา"," كيبورد للكتابة بالعربي","ウゥキュ,",""

UTF-8 Example

Now, parsing this file encoded in UTF-8 with the following code produces the expected output (I understand this example is mostly artificial, since my system encoding is UTF-8, and so no actual conversion to wide characters and then back to bytes is required):

#include <sstream>
#include <locale>
#include <iostream>
#include <fstream>
#include <codecvt>

std::wstring readFile(const char* filename)
{
    std::wifstream wif(filename, std::ios::binary);
    wif.imbue(std::locale(wif.getloc(), new std::codecvt_utf8<wchar_t, 0x10ffff>));
    std::wstringstream wss;
    wss << wif.rdbuf();
    return wss.str();
}


int main()
{
    std::wstring read = readFile("utf-8.csv");
    std::cout << read.size() << std::endl;

    using convert_type = std::codecvt_utf8<wchar_t>;
    std::wstring_convert<convert_type, wchar_t> converter;
    std::string converted_str = converter.to_bytes( read );
    std::cout << converted_str;

    return 0;
}

When the file is compiled and run (on Linux, so the system encoding is UTF-8), I get the following output:

$ g++ utf8.cpp -o utf8 -std=c++14
$ ./utf8
73
"This","佐藤 幹夫","Mêmes","친구"
"ภควา"," كيبورد للكتابة بالعربي","ウゥキュ,",""

UTF-16 Example

However, when I attempt a similar example with UTF-16, I get a truncated file, despite the file loading properly in text editors, Python, etc.

#include <fstream>
#include <sstream>
#include <iostream>
#include <locale>
#include <codecvt>
#include <string>


std::wstring readFile(const char* filename)
{
    std::wifstream wif(filename, std::ios::binary);
    wif.imbue(std::locale(wif.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff>));
    std::wstringstream wss;
    wss << wif.rdbuf();
    return wss.str();
}


int main()
{
    std::wstring read = readFile("utf-16.csv");
    std::cout << read.size() << std::endl;

    using convert_type = std::codecvt_utf8<wchar_t>;
    std::wstring_convert<convert_type, wchar_t> converter;
    std::string converted_str = converter.to_bytes( read );
    std::cout << converted_str;

    return 0;
}

When the file is compiled and run (on Linux, so the system encoding is UTF-8), I get the following output for the little endian format:

$ g++ utf16.cpp -o utf16 -std=c++14
$ ./utf16
19
"This","PO

For the big-endian format, I get the following:

$ g++ utf16.cpp -o utf16 -std=c++14
$ ./utf16
19
"This","OP

Interestingly, the CJK character should be part of the Basic Multilingual Plane, but is clearly not converted properly, and the file is truncated early. The same issue occurs with a line-by-line approach.

Other Resources

I checked the following resources before, most notable this answer, as well as this answer. None of their solutions have proven fruitful for me.

Other Specifics

LANG = en_US.UTF-8
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.2)

Any other details and I will be happy to provide them. Thank you.

EDITS

Adrian mentioned in the comments I should provide a hexdump, which is shown for "utf-16le", the little-endian UTF-16-encoded file:

0000000 0022 0054 0068 0069 0073 0022 002c 0022
0000010 4f50 85e4 0020 5e79 592b 0022 002c 0022
0000020 004d 00ea 006d 0065 0073 0022 002c 0022
0000030 ce5c ad6c 0022 000a 0022 0e20 0e04 0e27
0000040 0e32 0022 002c 0022 0020 0643 064a 0628
0000050 0648 0631 062f 0020 0644 0644 0643 062a
0000060 0627 0628 0629 0020 0628 0627 0644 0639
0000070 0631 0628 064a 0022 002c 0022 30a6 30a5
0000080 30ad 30e5 002c 0022 002c 0022 d83d dec2
0000090 0022 000a                              
0000094

qexyn mentioned removing the std::ios::binary flag, which I attempted but changed nothing.

Finally, I attempted using iconv to see if these were valid files, using both the command-line utility and the C-module.

$ iconv -f="UTF-16BE" -t="UTF-8" utf-16be.csv "This","佐藤 幹夫","Mêmes","친구" "ภควา"," كيبورد للكتابة بالعربي","ウゥキュ,",""

Clearly, iconv has no issue with the source files. This is leading me to use iconv, since it's cross-platform, easy-to-use, and well-tested, but if anyone has an answer with the standard library, I will gladly accept it.

Community
  • 1
  • 1
Alex Huszagh
  • 13,272
  • 3
  • 39
  • 67
  • 1
    "...the file loading properly in text editors," Not an answer to your question, but have you considered using one of those editors to export your file straight in UTF8? If it's an one-off, it may work. If it's not, then brace yourself for a lot of pain if no warranties that CSV files will always come in a sane encoding. I had such an experience, with inputs collected by copy/paste in Excel files, dumped as CSV on many computers with different locales and merged by cat-ting CSV-es. The only solution to this prob was to sanitize the process of collecting the input. – Adrian Colomitchi Sep 12 '16 at 00:26
  • 1
    @AdrianColomitchi, it's a good concern and **typically**, the files are standardized in... well, nothing. There's no actual guarantees of anything. I'm in a weird biology field where there was absolutely no standardization for the longest time, and new formats (such as XML) have now guaranteed UTF-8 but due to various reasons (needing nearly 2-3x as much space to store the same data and very slow parsing speeds), they really aren't used that often. So the answer is, I can warn others and use UTF-8 only for the common formats, or try to guess among about 5 encodings (including UTF-16). – Alex Huszagh Sep 12 '16 at 00:34
  • @AlexanderHuszagh: "*needing nearly 2-3x as much space to store the same data and very slow parsing speeds*" Have you not heard of zipped XML and the RapidXML parser? Or failing that, using JSON for data? – Nicol Bolas Sep 12 '16 at 00:45
  • @NicolBolas, I don't control the input formats directly (well, I control my own internal formats, but for standards compliance, I should be able to write to those formats). It's a bit of a weird story, but basically, there were a lot of various text-based formats with no standardization among keywords, formats, or other grammar. These were superseded by standardized XML-based formats, but the way they were implemented by the committee means they're slow and large. They also use base64/zlib compression, so they're not readable either. Together, it means I cannot solely support the new formats. – Alex Huszagh Sep 12 '16 at 00:48
  • But yes, I actually do support compressed files using file stream wrappers, which despite the zlib-compressed data (the metadata is plain text), has a very good compression ratio. There's a lot of bloat in those files. Anyway, just some background for why I am supporting a large number of heterogeneous text files. Sorry, I wish I could better control the input format. – Alex Huszagh Sep 12 '16 at 00:51
  • 1
    @AlexanderHuszagh The problem may be intractable in spite of your best effort. " I can warn others and use UTF-8 only for the common formats, or try to guess among about 5 encodings (including UTF-16)" Do both. Warn your stakeholders that your best effort may be not good enough (not because of you). Raise the flag to people with authority to fix the process of data collection - be a part of the process otherwise there's no warranty that the "fixed" process will solve the problems. Failing that... I hope you enjoy having nightmares. – Adrian Colomitchi Sep 12 '16 at 01:02
  • @AlexanderHuszagh - use a hexdumper and look to your input source around the places that result in truncation/loss of coherence etc. There may be extraneous BOM-s resulted from cat-ing files, there may be sudden changes in encoding. Without diagnosing your input, looking to your code may well be like looking for your lost keys under a street-light only because everywhere else is dark (useful at first, but only until you can see the keys are not there) – Adrian Colomitchi Sep 12 '16 at 01:07
  • FWIW, take note that on Linux the wchar_t type is 4 bytes wide, as opposed to Windows being 2 bytes wide. Possibly the way you are using wchar_t (4-byte char) in those UTF-16 (2-byte) converters is throwing something off. – Nicholas Smith Sep 12 '16 at 01:16
  • 1
    @qexyn given that he configures the converters with codec-xes from the same std:: implementation, this shouldn't be a problem. – Adrian Colomitchi Sep 12 '16 at 01:33
  • 1
    Why are you converting to utf-16 when writing to `cout`? That doesn't seem right, although it's probably unrelated to your problem. – Mark Ransom Sep 12 '16 at 04:08
  • @MarkRansom, you are correct, that makes absolutely no sense, but luckily it is unrelated to the problem. Fixed. – Alex Huszagh Sep 12 '16 at 04:12
  • Also, I tried using iconv with all the input files and have no issues. I can post a hex dump if others are interested, but using iconv with "UTF-8", "UTF-16BE", and "UTF-16LE" works for all the examples with no issues. – Alex Huszagh Sep 12 '16 at 04:14
  • Does the output change once you've fixed the problem I pointed out? – Mark Ransom Sep 12 '16 at 04:15
  • @MarkRansom, unfortunately it does not. – Alex Huszagh Sep 12 '16 at 04:15
  • Not testing right now but http://stackoverflow.com/questions/18814163/c-utf-16-to-char-conversion-linux-ubuntu?noredirect=1&lq=1 and http://stackoverflow.com/questions/10504044/correctly-reading-a-utf-16-text-file-into-a-string-without-external-libraries?noredirect=1&lq=1 indicate that `wif.imbue(std::locale(wif.getloc(), new std::codecvt_utf16));` should be `wif.imbue(std::locale(wif.getloc(), new std::codecvt_utf16), std::consume_header>));` – Adam Martin Sep 13 '16 at 03:00
  • @AdamMartin, there is no byte order mark, and according to the documentation, it's safe to leave off even with a BOM. I can test it though: http://en.cppreference.com/w/cpp/locale/codecvt_mode – Alex Huszagh Sep 13 '16 at 03:42
  • 1
    Works with clang/libc++ http://coliru.stacked-crooked.com/a/50c1d34cd0f3c930 – Cubbi Sep 13 '16 at 14:50
  • @Cubbi Interesting, this seems to be a GCC-specific bug then. Confirmed their GCC version also produces the same problem, and MinGW64 also produces the same issue while MSVC does not. I'll see if this is eligible for a bug report. http://coliru.stacked-crooked.com/a/c8a431595befc4ba – Alex Huszagh Sep 13 '16 at 16:46

1 Answers1

0

So I'm still waiting for a potential answer using the C++ standard library, but I haven't had any success, so I wrote an implementation that works with Boost and iconv (which are fairly common dependencies). It consists of a header and a source file, works will all of the above situations, is fairly performant, can accept any iconv pair of encodings, and wraps a stream object to allow easy intgration into existing code. As I'm fairly new to C++, I would test the code if you choose to implement it yourself: I'm far from an expert.

encoding.hpp

#pragma once

#include <iostream>

#if defined(_MSC_VER) && (_MSC_VER >= 1020)
# pragma once
#endif

#include <cassert>
#include <iosfwd>            // streamsize.
#include <memory>            // allocator, bad_alloc.
#include <new>
#include <string>
#include <boost/config.hpp>
#include <boost/cstdint.hpp>
#include <boost/detail/workaround.hpp>
#include <boost/iostreams/constants.hpp>
#include <boost/iostreams/detail/config/auto_link.hpp>
#include <boost/iostreams/detail/config/dyn_link.hpp>
#include <boost/iostreams/detail/config/wide_streams.hpp>
#include <boost/iostreams/detail/config/zlib.hpp>
#include <boost/iostreams/detail/ios.hpp>
#include <boost/iostreams/filter/symmetric.hpp>
#include <boost/iostreams/pipeline.hpp>
#include <boost/type_traits/is_same.hpp>
#include <boost/iostreams/filter/zlib.hpp>
#include <iconv.h>

// Must come last.
#ifdef BOOST_MSVC
#   pragma warning(push)
#   pragma warning(disable:4251 4231 4660)     // Dependencies not exported.
#endif
#include <boost/config/abi_prefix.hpp>
#undef small


namespace boost
{
namespace iostreams
{
// CONSTANTS
// ---------

extern const size_t maxUnicodeWidth;

// OBJECTS
// -------


/** @brief Parameters for input and output encodings to pass to iconv.
 */
struct encoded_params {
    std::string input;
    std::string output;

    encoded_params(const std::string &input = "UTF-8",
                   const std::string &output = "UTF-8"):
        input(input),
        output(output)
    {}
};


namespace detail
{
// DETAILS
// -------


/** @brief Base class for the character set conversion filter.
 *  Contains a core process function which converts the source
 *  encoding to the destination encoding.
 */
class BOOST_IOSTREAMS_DECL encoded_base {
public:
    typedef char char_type;
protected:
    encoded_base(const encoded_params & params = encoded_params());

    ~encoded_base();

    int convert(const char * & src_begin,
                const char * & src_end,
                char * & dest_begin,
                char * & dest_end);

    int copy(const char * & src_begin,
                const char * & src_end,
                char * & dest_begin,
                char * & dest_end);

    int process(const char * & src_begin,
                const char * & src_end,
                char * & dest_begin,
                char * & dest_end,
                int /* flushLevel */);

public:
    int total_in();
    int total_out();


private:
    iconv_t conv;
    bool differentCharset;
};


/** @brief Template implementation for the encoded writer.
 *
 *  Model of a C-style file filter for character set conversions, via
 *  iconv.
 */
template<typename Alloc = std::allocator<char> >
class encoded_writer_impl : public encoded_base {
public:
    encoded_writer_impl(const encoded_params &params = encoded_params());
    ~encoded_writer_impl();
    bool filter(const char*& src_begin, const char* src_end,
                char*& dest_begin, char* dest_end, bool flush);
    void close();
};


/** @brief Template implementation for the encoded reader.
 *
 *  Model of a C-style file filter for character set conversions, via
 *  iconv.
 */
template<typename Alloc = std::allocator<char> >
class encoded_reader_impl : public encoded_base {
public:
    encoded_reader_impl(const encoded_params &params = encoded_params());
    ~encoded_reader_impl();
    bool filter(const char*& begin_in, const char* end_in,
                char*& begin_out, char* end_out, bool flush);
    void close();
    bool eof() const
    {
        return eof_;
    }

private:
    bool eof_;
};



}   /* detail */

// FILTERS
// -------

/** @brief Model of InputFilter and OutputFilter implementing
 *  character set conversion via iconv.
 */
template<typename Alloc = std::allocator<char> >
struct basic_encoded_writer
    : symmetric_filter<detail::encoded_writer_impl<Alloc>, Alloc>
{
private:
    typedef detail::encoded_writer_impl<Alloc>         impl_type;
    typedef symmetric_filter<impl_type, Alloc>  base_type;
public:
    typedef typename base_type::char_type               char_type;
    typedef typename base_type::category                category;
    basic_encoded_writer(const encoded_params &params = encoded_params(),
                         int buffer_size = default_device_buffer_size);
    int total_in() { return this->filter().total_in(); }
};
BOOST_IOSTREAMS_PIPABLE(basic_encoded_writer, 1)

typedef basic_encoded_writer<> encoded_writer;


/** @brief Model of InputFilter and OutputFilter implementing
 *  character set conversion via iconv.
 */
template<typename Alloc = std::allocator<char> >
struct basic_encoded_reader
    : symmetric_filter<detail::encoded_reader_impl<Alloc>, Alloc>
{
private:
    typedef detail::encoded_reader_impl<Alloc>       impl_type;
    typedef symmetric_filter<impl_type, Alloc>  base_type;
public:
    typedef typename base_type::char_type               char_type;
    typedef typename base_type::category                category;
    basic_encoded_reader(const encoded_params &params = encoded_params(),
                         int buffer_size = default_device_buffer_size);
    int total_out() { return this->filter().total_out(); }
    bool eof() { return this->filter().eof(); }
};
BOOST_IOSTREAMS_PIPABLE(basic_encoded_reader, 1)

typedef basic_encoded_reader<> encoded_reader;


namespace detail
{
// IMPLEMENTATION
// --------------


/** @brief Initialize the encoded writer with the iconv parameters.
 */
template<typename Alloc>
encoded_writer_impl<Alloc>::encoded_writer_impl(const encoded_params& p):
    encoded_base(p)
{}


/** @brief Close the encoded writer.
 */
template<typename Alloc>
encoded_writer_impl<Alloc>::~encoded_writer_impl()
{}


/** @brief Implementation of the symmetric, character set encoding filter
 *  for the writer.
 */
template<typename Alloc>
bool encoded_writer_impl<Alloc>::filter
    (const char*& src_begin, const char* src_end,
     char*& dest_begin, char* dest_end, bool flush)
{
    int result = process(src_begin, src_end, dest_begin, dest_end, flush);
    return result == -1;
}


/** @brief Close the encoded writer.
 */
template<typename Alloc>
void encoded_writer_impl<Alloc>::close()
{}


/** @brief Close the encoded reader.
 */
template<typename Alloc>
encoded_reader_impl<Alloc>::~encoded_reader_impl()
{}


/** @brief Initialize the encoded reader with the iconv parameters.
 */
template<typename Alloc>
encoded_reader_impl<Alloc>::encoded_reader_impl(const encoded_params& p):
    encoded_base(p),
    eof_(false)
{}


/** @brief Implementation of the symmetric, character set encoding filter
 *  for the reader.
 */
template<typename Alloc>
bool encoded_reader_impl<Alloc>::filter
    (const char*& src_begin, const char* src_end,
    char*& dest_begin, char* dest_end, bool /* flush */)
{
    int result = process(src_begin, src_end, dest_begin, dest_end, true);
    return result;
}


/** @brief Close the encoded reader.
 */
template<typename Alloc>
void encoded_reader_impl<Alloc>::close()
{
    // cannot re-open, not a true stream
    //eof_ = false;
    //reset(false, true);
}

}   /* detail */


/** @brief Initializer for the symmetric write filter, which initializes
 *  the iconv base from the parameters and the buffer size.
 */
template<typename Alloc>
basic_encoded_writer<Alloc>::basic_encoded_writer
(const encoded_params& p, int buffer_size):
    base_type(buffer_size, p)
{}


/** @brief Initializer for the symmetric read filter, which initializes
 *  the iconv base from the parameters and the buffer size.
 */
template<typename Alloc>
basic_encoded_reader<Alloc>::basic_encoded_reader(const encoded_params &p, int buffer_size):
    base_type(buffer_size, p)
{}


}   /* iostreams */
}   /* boost */

#include <boost/config/abi_suffix.hpp> // Pops abi_suffix.hpp pragmas.
#ifdef BOOST_MSVC
    # pragma warning(pop)
#endif

encoding.cpp

#include "encoding.hpp"

#include <iconv.h>

#include <algorithm>
#include <cstring>
#include <string>


namespace boost
{
namespace iostreams
{
namespace detail
{
// CONSTANTS
// ---------

const size_t maxUnicodeWidth = 4;

// DETAILS
// -------


/** @brief Initialize the iconv converter with the source and
 *  destination encoding.
 */
encoded_base::encoded_base(const encoded_params &params)
{
    if (params.output != params.input) {
        conv = iconv_open(params.output.data(), params.input.data());
        differentCharset = true;
    } else {
        differentCharset = false;
    }
}


/** @brief Cleanup the iconv converter.
 */
encoded_base::~encoded_base()
{
    if (differentCharset) {
        iconv_close(conv);
    }
}


/** C-style stream converter, which converts the source
 *  character array to the destination character array, calling iconv
 *  recursively to skip invalid characters.
 */
int encoded_base::convert(const char * & src_begin,
                          const char * & src_end,
                          char * & dest_begin,
                          char * & dest_end)
{
    char *end = dest_end - maxUnicodeWidth;
    size_t srclen, dstlen;
    while (src_begin < src_end && dest_begin < end) {
        srclen = src_end - src_begin;
        dstlen = dest_end - dest_begin;
        char *pIn = const_cast<char *>(src_begin);
        iconv(conv, &pIn, &srclen, &dest_begin, &dstlen);
        if (src_begin == pIn) {
            src_begin++;
        } else {
            src_begin = pIn;
        }
    }

    return 0;
}


/** C-style stream converter, which copies source bytes to output
 *  bytes.
 */
int encoded_base::copy(const char * & src_begin,
                          const char * & src_end,
                          char * & dest_begin,
                          char * & dest_end)
{
    size_t srclen = src_end - src_begin;
    size_t dstlen = dest_end - dest_begin;
    size_t length = std::min(srclen, dstlen);

    memmove((void*) dest_begin, (void *) src_begin, length);
    src_begin += length;
    dest_begin += length;

    return 0;
}


/** @brief Processes the input stream through the stream filter.
 */
int encoded_base::process(const char * & src_begin,
                          const char * & src_end,
                          char * & dest_begin,
                          char * & dest_end,
                          int /* flushLevel */)
{
    if (differentCharset) {
        return convert(src_begin, src_end, dest_begin, dest_end);
    } else {
        return copy(src_begin, src_end, dest_begin, dest_end);
    }
}


}   /* detail */
}   /* iostreams */
}   /* boost */

Sample Program

#include "encoding.hpp"

#include <boost/iostreams/filtering_streambuf.hpp>
#include <fstream>
#include <string>


int main()
{
    std::ifstream fin("utf8.csv", std::ios::binary);
    std::ofstream fout("utf16le.csv", std::ios::binary);

    // encoding
    boost::iostreams::filtering_streambuf<boost::iostreams::input> streambuf;
    streambuf.push(boost::iostreams::encoded_reader({"UTF-8", "UTF-16LE"}));
    streambuf.push(fin);
    std::istream stream(&streambuf);

    std::string line;
    while (std::getline(stream, line)) {
        fout << line << std::endl;
    }
    fout.close();
}

In the above example, we write a copy of a UTF-8-encoded file to UTF-16LE, using a streambuffer to convert the UTF-8 text to UTF-16LE, which we write as bytes to out output, only adding 4 lines of (readable) code for our entire process.

Alex Huszagh
  • 13,272
  • 3
  • 39
  • 67