2

Using MinGW 7.3.0 on Windows, Hunspell can't load the dictionary files from locations that have non-ASCII characters because of Windows limitations. I've tried everything[1] and I'm now resorting to copying the file to a path without ASCII characters before giving it to Hunspell. What is a good location to copy it to?

[1]

  1. Windows requires wchar_t support for std::iostream.open() to work right, which MinGW does not implement
  2. std::filesystem can solve this, but only available in GCC 8
  3. Hunspell insists on loading files on its own, it is not possible to pass the read files as strings to it
Vadim Peretokin
  • 2,221
  • 3
  • 29
  • 40
  • 1
    Check if you can force `CreateFile` handle into `std::ifstream` like [https://stackoverflow.com/a/476014/8666197](https://stackoverflow.com/a/476014/8666197). If gcc implements needed functionality then you need to modify `myopen` function and probably `FileMgr::~FileMgr`. Alternatively reimplement `FileMgr` class. It look that you need to implement one essential funtion `getline`. – Daniel Sęk Jul 20 '19 at 09:17

3 Answers3

3

The "natural" fit would be the use the user's choosen temporary directory (or subdirectory thereof) (see %temp% or GetTempPath()). However, that defaults to something that contains the user name (which can contain "non-ASCII" characters; e.g. c:\users\Ø¥Ć¼\AppData\LocalLow\Temp) or something arbitrary (regarding character set) all together.

So you're most likely best off to choose some directory that

a) does not contain off-limits characters from the get go. For example, a directory underneath C:\ProgramData that you choose yourself (e.g. the application name) that you know does not contain non-ASCII characters.

b) let the user decide where to put these files and make sure it is not permissible to enter a path that contains only allowed characters.

c) Pass the "short path name" to Hunspell, which should not contain non-ASCII characters for compatibility with FAT file system traits. For example, the short path name for c:\temp\Ø¥Ć¼ is c:\temp\571D~1.

You can see the short names for directories using cmd.exe /c dir /x:

C:\temp>dir /x
...    
19.07.2019  15:30    <DIR>                       .
19.07.2019  15:30    <DIR>                       ..
19.07.2019  15:30    <DIR>          571D~1       Ø¥Ć¼

How you can invoke the GetShortPathName Win32 API from MinGW I don't know, but I would assume that it is possible.

Also make sure to review the MSDN page for the above function for traitoffs, e.g. short names are not supported everywhere (e.g. SMB + see comments below).

Christian.K
  • 47,778
  • 10
  • 99
  • 143
  • Thanks - I can't write to a) and b) is a no-go. Wasn't sure if MinGW provides that API for C, so instead I decided to copy to `C:\Windows\Temp` - and it works! – Vadim Peretokin Jul 19 '19 at 14:35
  • 1
    Application files do not belong inside the `Windows` folder, only system files do. Find a better suited folder. – Remy Lebeau Jul 19 '19 at 22:59
  • 1
    I am asking this very question to find out which is a good folder, so, advice is welcome! – Vadim Peretokin Jul 20 '19 at 04:18
  • 1
    @VadimPeretokin Note that `GetShortPathNameW` isn't reliable, because [8.3 file name generation may be disabled](https://support.microsoft.com/en-us/help/121007/how-to-disable-8-3-file-name-creation-on-ntfs-partitions). Future OS version may have it disabled by default for better performance. – zett42 Jul 20 '19 at 12:28
  • 1
    @zett42, `GetShortPathNameW` is definitely unreliable -- today. I always disable short-name creation on NTFS volumes because it slows down access, especially in directories with thousands of files. Also, newer filesystems such as ReFS and exFAT do not support short names at all, and Hunspell isn't necessarily installed on the NTFS system drive. – Eryk Sun Jul 21 '19 at 02:28
3

From this bug tracker:

In WIN32 environment, use UTF-8 encoded paths started with the long path prefix \\?\ to handle system-independent character encoding and very long path names (without the long path prefix Hunspell will use fopen() with system-dependent character encoding instead of _wfopen()).

So the actual solution seems to be:

  1. Call GetFullPathNameW to normalize the path. Required because paths with long path prefix \\?\ are passed to the NT API unchanged.
  2. Prepend L"\\\\?\\" to the normalized path (backslashes doubled because of C string literal requirements).
  3. For a UNC path, you have to use the "UNC" device directly (i. e. L"\\\\server\\share"L"\\\\?\\UNC\\server\\share" (thanks eryksun)
  4. Encode the path in UTF-8, e. g. using WideCharToMultiByte() with CP_UTF8.
  5. Pass the final UTF-8 encoded path to Hunspell.
zett42
  • 25,437
  • 3
  • 35
  • 72
  • 1
    I would expect only the UTF-8 encoding is really required, as opposed to rewriting the path as an extended device path. That's assuming it's not a matter of a reserved name in the final component (i.e. a DOS device name or a name that ends with spaces or dots) and not a matter of the `MAX_PATH` (260) length limit. If you do use an extended device path, then for a UNC path, you have to use the "UNC" device directly (i.e. `"\\\\server\\share"` -> `"\\\\?\\UNC\\server\\share"`). – Eryk Sun Jul 21 '19 at 02:36
  • That should be the accepted answer. Would delete mine, but it is still accepted so I can't. – Christian.K Jul 21 '19 at 07:02
  • @Christian.K Looks like OP has [tried this before](https://stackoverflow.com/q/56988605/7571258) with no success. – zett42 Jul 21 '19 at 10:53
  • Hmm... OK. So I leave things as they are ;-) – Christian.K Jul 22 '19 at 06:23
-4

It looks like C:\Windows\Temp is still a valid path you can write to yourself.

Vadim Peretokin
  • 2,221
  • 3
  • 29
  • 40
  • 2
    There is no guarantee that such a folder is the actual `%TEMP%` path being used, though (in fact, it likely isn't). And non-admin users don't have write access inside of the `Windows` folder or its sub-folders. – Remy Lebeau Jul 19 '19 at 22:57