2

I have a file like below:

$ xxd 1line
0000000: 3939 ba2f 6f20 6f66 0d0a                 99./o of..

I would like to read this one line in C++:

#include <codecvt>
#include <iostream>
#include <locale>
#include <fstream>
#include <string>

int main(int argc, char** argv) {
  std::wifstream wss(argv[1], std::ios::binary);
  wss.seekg(std::ios_base::end);
  const auto fileSize = wss.tellg();
  wss.seekg(std::ios_base::beg);

  // std::locale utf8_locale(wss.getloc(), new std::codecvt_utf8<wchar_t, 0x10FFFF, std::consume_header>);
  // wss.imbue(utf8_locale);

  std::wstring wline;
  std::getline(wss, wline);

  std::cout << "filelen: " << fileSize << std::endl;
  std::cout << "strlen: " << wline.size() << std::endl;
  std::wcout << "str: " << wline << std::endl;

  return 0;
}

I compile it in below way:

$ g++ -std=c++11 imbue_issue.cpp

First thing: it seems that wss.seekg(std::ios_base::end) does not moves file position at the end of the file:

$ ./a.out 1line
filelen: 2
strlen: 9
str: 99?/o of

Second thing is when uncomment locale related lines, getline reads only 2 characters:

$ ./a.out 1line
filelen: 2
strlen: 2
str: 99

My compiler:

$ g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/usr/include/c++/4.2.1
Apple LLVM version 7.3.0 (clang-703.0.31)
Target: x86_64-apple-darwin15.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Does anyone have idea what is the reason why above issues occur with this file?

mkk
  • 675
  • 6
  • 17
  • 1
    What is the size of `wchar_t` on your system? I'll bet that it's `4`. That means the size of the file is 2.5 `wchar_t` characters large, which gets truncated to `2`. – Some programmer dude Oct 18 '16 at 08:20
  • Yes, it is 4 bytes. – mkk Oct 18 '16 at 08:22
  • Also, judging by the data-dump of the file contents, it doesn't seem to be a file of wide characters, but of narrow characters. Try using "ordinary" narrow `char` instead, and see what results you get. – Some programmer dude Oct 18 '16 at 08:22
  • Funny thing, when I changed from wifstream to ifstream, the fileSize still equals 2. Is ifstream based on char? I checked that sizeof(char) on my system is 1. – mkk Oct 18 '16 at 08:28

2 Answers2

1

The problem is how you call the seekg function. When you call it with one argument it is used as an absolute position from the beginning, and you will seek to whatever value std::ios::end have, which happens to be 2 in your case.

Instead you should use the two-argument overload:

wss.seekg(0, std::ios_base::end);  // Seek to offset 0 from the end

You will still have problems reading the file using wide-character types, since the contents doesn't seem to be that. UTF-8 is a multi-byte narrow character encoding.

Some programmer dude
  • 400,186
  • 35
  • 402
  • 621
  • I understand that imbue problem is related with using wifstream instead of ifstream, right? Is there any generic way of reading files in C++ if we don't know what contents are? If not, how to programmatically decide if ifstream of wifstream is needed? – mkk Oct 18 '16 at 08:53
  • @mkk No there's really no good way of detecting how the data was saved or its encoding, unless the file has some kind of header or similar like the UTF-8 [byte-order mark](https://en.wikipedia.org/wiki/Byte_order_mark). – Some programmer dude Oct 18 '16 at 08:57
0

I found that someone had similar issues with getline:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15733

mkk
  • 675
  • 6
  • 17