0

This post is not a duplicate of this one: dirent not working with unicode

Because here I'm using it on a different OS and I also don't want to do the same thing. The other thread is trying to simply count the files, and I want to access the file name which is more complex.


I'm trying to retrieve data information through files names on a windows 10 OS.

For this purpose I use dirent.h(external c library, but still very usefull also in c++).

DIR* directory = opendir(path);
struct dirent* direntStruct;

if (directory != NULL)
{
    while (direntStruct = readdir(directory))
    {            
        cout << direntStruct->d_name << endl;
    }
}

This code is able to retrieve all files names located in a specific folder (one by one). And it works pretty well!

But when it encounter a file containing the character 'œ' then things are going crazy:

Example:

grosse blessure au cœur.txt

is read in my program as:

GUODU0~6.TXT

I'm not able to find the original data in the string name because as you can see my string variable has nothing to do with the current file name!

I can rename the file and it works, but I don't want to do this, I just need to read the data from that file name and it seems impossible. How can I do this?

Impact
  • 31
  • 6
  • _"my string variable has nothing to do with the current file name!"_ Are you sure? ;) – Asteroids With Wings Nov 04 '20 at 15:27
  • @cigien: Bad dupe. Wrong problem description, wrong platform, no applicable solution... – Asteroids With Wings Nov 04 '20 at 15:28
  • @AsteroidsWithWings Hmm, the problem description seems the same. Also, why do you think the platform is wrong? – cigien Nov 04 '20 at 15:30
  • @cigien Because those answers are for Mac and Linux, and this question is about Windows? The problem description is completely different: that other question is about `readdir` skipping files; this one is about receiving DOS "short paths" instead of full paths. Please read the questions fully before dupe-closing: note that it may take more than 3 minutes to do so. :) – Asteroids With Wings Nov 04 '20 at 15:31
  • @Asteroids where do you see OP talking about Windows as their OS here? – scohe001 Nov 04 '20 at 15:33
  • 1
    @scohe001 `GUODO0~6.txt` is a short path that you find on Windows. Also there's a hint (though this is not proof) with the `.txt` filename extension. When you know, you know. – Asteroids With Wings Nov 04 '20 at 15:33
  • @AsteroidsWithWings I still think it's a dupe, and a windows solution should be added there. Also, can you edit the question to clarify why this is a Windows question? It's unclear to me. – cigien Nov 04 '20 at 15:33
  • I see you edited the title to say "on windows". Could you also link to that other answer, and explain why that doesn't solve your problem? That will prevent others from accidentally closing the question. – cigien Nov 04 '20 at 15:42
  • Thank you :) The edit makes it a lot clearer now. One more thing, the code seems to be in C, but you've tagged it with C++. Can you change that tag as well? – cigien Nov 04 '20 at 15:46
  • @cigien the printf is not really my code, indeed you are right dirent.h is a c library however I use it in c++. So I should rename it in C/C++ maybe ? – Impact Nov 04 '20 at 15:56
  • Oh, if you are compiling your code with a C++ compiler, then definitely add that tag. But also keep the C tag, since as you mentioned `dirent.h` is a C library. – cigien Nov 04 '20 at 15:57
  • That's not exactly right. Instead of changing the title, add the appropriate tag. And then remove the "in C/C++" from the title entirely :) – cigien Nov 04 '20 at 15:59

3 Answers3

1

On Windows you can use FindFirstFile() or FindFirstFileEx() followed by FindNextFile() to read the contents of a directory with Unicode in the returned file names.

janm
  • 17,976
  • 1
  • 43
  • 61
  • Here's what I did: `HANDLE hFind; WIN32_FIND_DATAA data; string localpath = all_paths[i] + "*"; hFind = FindFirstFileA(localpath.c_str(), &data); if (hFind != INVALID_HANDLE_VALUE) { do { printf("%s\n", data.cFileName); string testy = data.cFileName; } while (FindNextFileA(hFind, &data)); FindClose(hFind); }` – Impact Nov 04 '20 at 16:27
1

Short File Name

The name you receive is the 8.3 short file name NTFS generates for non-ascii file names, so they can be accessed by programs that don't support unicode.

clinging to dirent

If dirent doesn't support UTF-16, your best bet may be to change your library.

However, depending on the implementation of the library you may have luck with:

  • adding / changing the manifest of your application to support UTF-8 in char-based Windows API's. This requires a very recent version of Windows 10.
    see MSDN: Use the UTF-8 code page under Windows - Apps - UWP - Design and UI - Usability - Globalization and localization.

  • setting the C++ Runtime's code page to UTF-8 using setlocale

I do not recommend this, and I don't know if this will work.

life is change

Use std::filesystem to enumerate directory content. A simple example can be found here (see the "Update 2017").

Windows only

You can use FindFirstFileW and FindNextFileW as platform API's that support UTF16 strings. However, with std::filesystem there's little reason to do so (at least for your use case).

peterchen
  • 40,917
  • 20
  • 104
  • 186
  • Thanks for this complete explanation. I already went on the thread you linked, filesystem did not work in my case, people suggest to use experience::filesystem and it still doesn't work by now in 2020. I wasn't abble to make it working. – Impact Nov 04 '20 at 16:33
  • Are you using Visual Studio? Which version? – peterchen Nov 04 '20 at 16:46
  • Yes I do, but anyway I do not need the `W`, the `A` did the job. I can now read any type of characters in the file name: `Test - 160 - Testament - Ton cœur est ici éééé ààà !!!!! ö ö à &_-.mkv` stay the same in my string variable. – Impact Nov 04 '20 at 17:01
0

If you're in C, use the OS functions directly, specifically FindFirstFileW and FindNextFileW. Note the W at the end, you want to use the wide versions of these functions to get back the full non-ASCII name.

In C++ you have more options, specifically with Boost. You have classes like recursive_directory_iterator which allow cross-platform file searching, and they provide UTF-8/UTF-16 file names.

Edit: Just to be absolutely clear, the file name you get back from your original code is correct. Due to backwards compatibility in Windows filesystems (FAT32 and NTFS), every file has two names: the "full", Unicode aware name, and the "old" 8.3 name from DOS days.

You can absolutely use the 8.3 name if you want, just don't show it to your users or they'll be (correctly) confused. Or just use the proper, modern API to get the real name.

Blindy
  • 65,249
  • 10
  • 91
  • 131
  • Thanks, but using the `W` it does not allow me to cast the result into a string, this will force me to use wstring. The problem is: wstring does not contain native functions like `find` or `replace` which is not really helpfull in my case. EDIT: I found a fix strID.find(L"LABS"); (my bad) – Impact Nov 04 '20 at 16:31
  • Yes it does, what are you talking about? You can't use `string` anyway, because the file name you mentioned doesn't have an ASCII representation in the first place, you need an UTF representation. – Blindy Nov 04 '20 at 16:32
  • @Blindy The "A" version isn't "ASCII", it's "ANSI" and Windows code pages will be used for translating values between 128 and 255. – janm Nov 05 '20 at 08:56