4

I'm trying to read a file which has UTF-16LE coding with BOM. I tried this code

#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>

int main() {

  std::wifstream fin("/home/asutp/test");
  fin.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
  if (!fin) {
    std::cout << "!fin" << std::endl;
    return 1;
  }
  if (fin.eof()) {
    std::cout << "fin.eof()" << std::endl;
    return 1;
  }
  std::wstring wstr;
  getline(fin, wstr);
  std::wcout << wstr << std::endl;

  if (wstr.find(L"Test") != std::string::npos) {
    std::cout << "Found" << std::endl;
  } else {
    std::cout << "Not found" << std::endl;
  }

  return 0;
}

The file can contain Latin and Cyrillic. I created the file with a string "Test тест". And this code returns me

/home/asutp/CLionProjects/untitled/cmake-build-debug/untitled

Not found

Process finished with exit code 0

I'm on Linux Mint 18.3 x64, Clion 2018.1

Tried

  • gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)
  • clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final)
  • clang version 5.0.0-3~16.04.1 (tags/RELEASE_500/final)
Kot Shrodingera
  • 85
  • 1
  • 4
  • 12
  • My test file http://rgho.st/7xH6WMcGZ – Kot Shrodingera Jun 05 '18 at 09:38
  • Print out what's in `wstr`? – Paul Sanders Jun 05 '18 at 09:40
  • @PaulSanders there's std::wcout << wstr << std::endl; in code. Blank line is printed(before "Not found") – Kot Shrodingera Jun 05 '18 at 09:46
  • Sorry, missed that. I'm not an expert on `std::codecvt`, but you could consider switching to `std::basic_string` and code `u"Test"` instead of `L"Test"` and thus avoid the need for it altogether. – Paul Sanders Jun 05 '18 at 10:17
  • This code works fine for me. Are you sure that your file has BOM? – user7860670 Jun 05 '18 at 10:22
  • Your code works fine in Windows (where `wchar_t` is 2 bytes). The standard is not exactly clear, it suggests parts of the code is deprecated, but it doesn't say what it's replaced with. Show the printout for this: `for(auto c : wstr) cout << int(c) << " ";` – Barmak Shemirani Jun 05 '18 at 20:03
  • @PaulSanders if i switch to `basic_string` getline doesn't work with wifstream. If i change wifstream to `basic_ifstream` getline throws `terminate called after throwing an instance of 'std::bad_cast' what(): std::bad_cast` @VTT I postet file in first comment. First bytes are FF FE @BarmakShemirani your code printed nothing, wstr is empty. `wchar_t` is 4 bytes on my platform, but i tried adding compiler option `-fshort-wchar` so sizeof(wchar_t)` returns 2, but my code still doesn't work – Kot Shrodingera Jun 06 '18 at 02:00
  • Take a look at the answer given by @Barmak Shemirani below. You should consider to save file in UTF8 before reading by "getline" (or whatever). While your code works on Windows, it will have problems under Linux. – SChepurin Jun 06 '18 at 15:05
  • It is possible to read any file as a sequence of bytes. What do you want to do with the content? – n. m. could be an AI Jun 06 '18 at 15:54
  • Anyway, probably the simplest way to read a file in any known encoding is to use libiconv, optionally with one of the many available C++ stream-based wrappers. – n. m. could be an AI Jun 06 '18 at 16:02
  • Not the cause of your problem, but you don’t want to pass a `new std::codecvt_utf16` as the parameter of `imbue()`. It will leak memory. Remove the keyword `new` to pass in a temporary object that will be cleaned up automatically. – Davislor Jan 06 '19 at 04:41
  • You’re reading in only a single line, and apparently finishing prematurely. Have you tried reading with `fin.get()` instead? – Davislor Jan 06 '19 at 04:44

2 Answers2

8

Ideally you should save files in UTF8, because Window has much better UTF8 support (aside from displaying Unicode in console window), while POSIX has limited UTF16 support. Even Microsoft products favor UTF8 for saving files in Windows.

As an alternative, you can read the UTF16 file in to a buffer and convert that to UTF8 (std::codecvt_utf8_utf16)

std::ifstream fin("utf16.txt", std::ios::binary);
fin.seekg(0, std::ios::end);
size_t size = (size_t)fin.tellg();

//skip BOM
fin.seekg(2, std::ios::beg);
size -= 2;

std::u16string u16((size / 2) + 1, '\0');
fin.read((char*)&u16[0], size);

std::string utf8 = std::wstring_convert<
    std::codecvt_utf8_utf16<char16_t>, char16_t>{}.to_bytes(u16);

Or
std::ifstream fin("utf16.txt", std::ios::binary);

//skip BOM
fin.seekg(2);

//read as raw bytes
std::stringstream ss;
ss << fin.rdbuf();
std::string bytes = ss.str();

//make sure len is divisible by 2
int len = bytes.size();
if(len % 2) len--;

std::wstring sw;
for(size_t i = 0; i < len;)
{
    //little-endian
    int lo = bytes[i++] & 0xFF;
    int hi = bytes[i++] & 0xFF;
    sw.push_back(hi << 8 | lo);
}

std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
std::string utf8 = convert.to_bytes(sw);
malat
  • 12,152
  • 13
  • 89
  • 158
Barmak Shemirani
  • 30,904
  • 6
  • 40
  • 77
  • I'm getting new files in UTF-16 encoding, and I can't change them. I've already end up with similar solution using `wstring_convert`. But anyway, thank you very much for your solution, it's good to know different approaches. But still, I wonder, why `imbue` method didn't work for me, because other people sad it worked, even on linux – Kot Shrodingera Jun 07 '18 at 04:21
  • This answer didn't consider surrogate pairs. – Searene Jan 05 '19 at 09:54
  • @BarmakShemirani Sorry, I thought that you were trying to convert part of a surrogate pair to utf8, but I was wrong, I just tried your code and found that it worked. Sorry for the misleading comment and the downvote, but I cannot undo the downvote unless the post is edited. I will be more cautious next time. – Searene Jan 06 '19 at 04:19
  • @Searene I appreciate the comment. It's better than people who vote down without explaining why. I edited the answer the by the way. – Barmak Shemirani Jan 06 '19 at 04:29
  • @BarmakShemirani "Even Microsoft products favor UTF8 for saving files in Windows." - This is simply not true nor has been for many years. In all their binary formats they've been using UTF-16 text encoding for a VERY long time (since the XP times IIRC). – CoolKoon Sep 01 '23 at 20:34
0

Replace by this - std::wstring::npos (not std::string::npos) -, and your code must work :

...
 //std::wcout << wstr << std::endl;

  if (wstr.find(L"Test") == std::wstring::npos) {
    std::cout << "Not Found" << std::endl;
  } else {
    std::cout << "found" << std::endl;
  } 
SChepurin
  • 1,814
  • 25
  • 17