1

I ran into a problem with the boost filestreams: I need to create and modify files in the user directory under windows. However the username contains an umlaut which makes this fail when compiled under MinGW as the standard is missing the wide_char open() API for filestreams which boost uses. see Read/Write file with unicode file name with plain C++/Boost, UTF-8-compliant IOstreams and https://svn.boost.org/trac10/ticket/9968

However I ran across the line, that this problem mainly occurs when trying to use a character outside the systems codepage. In my case I'm using only characters from the systems codepage as obviously the users directory exists. This makes me think, that this should work, if I could tell boost::path to expect all std::strings as beeing UTF8 but converting them to the system encoding when calling the string() member function (which happens in boost::fstream::open)

So basically: Is there any way to do that conversion (UTF8->system encoding) automatically using boost (and boost Locale)?

To be complete here is my code for setting the locale:

#ifdef _WIN32
        // On windows we want to enforce the encoding (mostly UTF8). Also using "" would use the default which uses "wrong" separators
        std::locale::global(boost::locale::generator().generate("C"));
#else
        // In linux / OSX this suffices
        std::locale::global(std::locale::classic());
#endif // _WIN32
        // Use also the encoding (mostly UTF8) for bfs paths
        bfs::path::imbue(std::locale());
Flamefire
  • 5,313
  • 3
  • 35
  • 70

2 Answers2

1

It is a problem on Windows, because Windows uses UTF-16, not UTF-8. I use this function regularly to solve your very problem:

// get_filename_token.cpp

// Turns a UTF-8 filename into something you can pass to fstream::open() on 
// Windows. Returns the argument on other systems.

// Copyright 2013 Michael Thomas Greer
// Distributed under the Boost Software License, Version 1.0.
// (See accompanying file LICENSE_1_0.txt 
//  or copy at            http://www.boost.org/LICENSE_1_0.txt )

#ifdef _WIN32

#include <string>

#ifndef NOMINMAX
#define NOMINMAX
#endif
#include <windows.h>

std::string get_filename_token( const std::string& filename )
  {
  // Convert the UTF-8 argument path to a Windows-friendly UTF-16 path
  wchar_t* widepath = new wchar_t[ filename.length() + 1 ];
  MultiByteToWideChar( CP_UTF8, 0, filename.c_str(), -1, widepath, filename.length() + 1 );

  // Now get the 8.5 version of the name
  DWORD n = GetShortPathNameW( widepath, NULL, 0 );
  wchar_t* shortpath = new wchar_t[ n ];
  GetShortPathNameW( widepath, shortpath, n );

  // Convert the short version back to a C++-friendly char version
  n = WideCharToMultiByte( CP_UTF8, 0, shortpath, -1, NULL, 0, NULL, NULL );
  char* ansipath = new char[ n ];
  WideCharToMultiByte( CP_UTF8, 0, shortpath, -1, ansipath, n, NULL, NULL );

  std::string result( ansipath );

  delete [] ansipath;
  delete [] shortpath;
  delete [] widepath;

  return result;
  }

#else

std::string get_filename_token( const std::string& filename )
  {
  // For all other systems, just return the argument UTF-8 string.
  return filename;
  }

#endif

(I have stripped out a few things to post here.)

Dúthomhas
  • 8,200
  • 2
  • 17
  • 39
  • Yeah I guessed I'd need to go down that way. I was even thinking about creating a new iofstream class that works like the boost class just providing new open functions and ctors. Why are you converting back to UTF8 though? Wouldn't `CP_ACP` be better suited? Why is boost not going that way as this seems to be quite simple. Any drawbacks like the 8.3 name not always beeing ANSI or so? – Flamefire Sep 26 '17 at 02:36
  • Found the draw back: This only works with existing files. So one would need to make sure the file actually does exist which might open the door to race conditions when trying to implement a create with widechar then open with short-path approach. – Flamefire Sep 26 '17 at 03:38
  • Because it is cross-platform. All modern *nixen will open a UTF-8 filename and it doesn't break old code. Likewise _all_ Windows file names can be converted to an OS-acceptable 8.3 filename “token” which is technically a UTF-8 subset. – Dúthomhas Sep 26 '17 at 03:38
  • Yes, if you want to create a file containing non-ANSI characters, on Windows you MUST use UTF-16. That’s another function. – Dúthomhas Sep 26 '17 at 03:40
  • It seems 8.3 paths can be disabled on OS level which also makes this unsuitable. However I found Pathie and Boost.NoWide which could work (http://cppcms.com/files/nowide/html/, https://quintus.github.io/pathie-cpp/index.html) – Flamefire Sep 27 '17 at 21:10
  • LOL, I guess I never figured someone would _want_ to disable them... Make your link to nowide an answer and I'll upvote it. – Dúthomhas Sep 27 '17 at 21:27
  • Me neither, but the possibility means someone WILL break your code eventually. You know, the usual fight programmers vs users. Upvoted anyway as the solution might work for someone. – Flamefire Sep 27 '17 at 21:57
1

I found 2 solutions using another library where both have their drawbacks.

  1. Pathie (Docu) It looks like a full replacement of boost::filesystem providing UTF8 aware streams and path handling as well as symlink creation and other file/folder operations. Really cool is the builtin support for getting special directories (temp, HOME, programs folder and many more)
    Drawback: Only works as a dynamic library as the static build has bugs. Also might be overkill if you already use boost.
  2. Boost.NoWide (Docu) Provides alternatives to almost all file and stream handlers to support UTF8 on windows and falls back to standard functions on others. The filestreams accept UTF8 encoded values (for the name) and it uses boost itself.
    Drawback: No path handling and does not accept bfs::path or wide strings (bfs::path internal format on Windows is UTF16) so a patch would be required, although it is simple. Also requires a build for windows if you want to use std::cout etc with UTF8 strings (yes that works directly!)
    Another cool thing: It provides a class to convert the argc/argv to UTF8 on windows.
Flamefire
  • 5,313
  • 3
  • 35
  • 70