Cannot read to file(exist) with UNICODE

Question

I have a project which need to read path of SysData file.I want to move SysData file which contains "ç","ş","ğ" path way but cannot read this char.I have to read with UNICODE(like that utf-8).

There is code;

bool TSimTextFileStream::ReadLine  ( mstring * str )
{
        *str = "";
        char c = ' ';
        bool first = true;
        // while ( read ( hFile, &c, 1 ) )
        while ( fread ( &c, 1, 1, hFile ) )
        {
                if (first) first = false;
                #ifdef __linux__
                        if ( c == 13 )
                                continue;
                                else
                        if ( c == 10 )
                                break;
                                else
                                *str += c;
                #else
                         if( c == 13 || c == 10)
                             break;
                         else
                             *str += c;

                #endif
        }
        return !first;
}

And there is code, calling this method;

mstring GetSysDataDirectory ( )
{
    static mstring sysDataDir = "";
    if ( sysDataDir == "" )
    {
    if (mIsEnvironmentVarExist("SYSDATAPATH"))
    {
      mstring folder = mGetEnvVar("SYSDATAPATH");

      if (folder.size() == 0)
      {
        folder = mGetCurrentDir ( ) + "/SysData";
      }

      sysDataDir = folder;
    }
        else if ( mIsFileExist ( "SysDataPath.dat" ) )
        {
            TSimTextFileStream txtfile;
            txtfile.OpenFileForRead( "SysDataPath.dat" );
            mstring folder;
            if ( txtfile.ReadLine( &folder ) )
            {
                sysDataDir = folder;
            }
            else
            {
                sysDataDir = mGetCurrentDir ( ) + "/SysData";
            }
        }
        else
        {
            sysDataDir = mGetCurrentDir ( ) + "/SysData";
        }
    }

    return sysDataDir;
}

I search and find some solution but not work, like that;

bool TSimTextFileStream::OpenFileForRead(mstring fname)
{
        if (hFile != NULL) CloseFile();

        hFile = fopen(fname.c_str(), "r,ccs=UNICODE");

        if (hFile == NULL) return false; else return true;
}

and tried this;

hFile = fopen(fname.c_str(), "r,ccs=UTF-8");

But not work again. Can you help me please?

enter image description here

This situation is my problem :((

You didn't explain what the problem is, apart from the custom file-reading code. Visual C++ supports Unicode for at least 20 years. Most applications on Windows use Unicode for almost that long. You *don't* need to write your own code to read text files. UTF8 can be read/written just like *any* string though, it doesn't even need the Unicode-specific (ie UTF16/32) functions. What is `mstring`, why don't you use `std::string` and streams ? — Panagiotis Kanavos, Sep 11 '17 at 11:01
Have you tried creating a sample project with Visual Studio's project wizard? The sample project is a text editor with menu, styling, Unicode (UTF16) support. Unicode support is just a checkbox. You can select plain text, RTF, HTML editor views from combos. — Panagiotis Kanavos, Sep 11 '17 at 11:04
The only problem you may encounter is that Windows works natively with UTF16LE. Strings are either UTF16 Unicode or encoded. With UTF8 strings though, you can't tell whether they are UTF8 or ASCII unless you scan the entire string for non-ASCII characters. You'll have to *convert* the UTF8 string to UTF16 before passing it to any system function. Definitely before displaying it. — Panagiotis Kanavos, Sep 11 '17 at 11:09
Then `#undef mstring std::string`. That's an extremely bad idea. Everyone knows what a `std::string` is — Panagiotis Kanavos, Sep 11 '17 at 11:10
@PanagiotisKanavos It is a big project and they want to fix this just change few codes.I am junior software engineer(first month).so can you show the way and i am ready to search and learn. — cevapsızcagri, Sep 11 '17 at 11:10
Another thing to consider is that C++ got actual Unicode support and types starting with C++11. There are now `char16_t`, `char32_t` and the corresponding standard types. UTF8 though still has to use the 8-bit types. — Panagiotis Kanavos, Sep 11 '17 at 11:17

score 2 · Answer 1 · answered Sep 11 '17 at 11:10

2

Windows does not support UTF-8 encoded path names for fopen:

The fopen function opens the file that is specified by filename. By default, a narrow filename string is interpreted using the ANSI codepage (CP_ACP).

Source.

Instead, a second function, called _wfopen is provided, which accepts a wide-character string as path argument.

Similar restrictions apply when using the C++ fstreams for File I/O.

So the only way for you to solve this is by converting your UTF-8 encoded pathname either to the system codepage or to a wide character string.

answered Sep 11 '17 at 11:10

ComicSansMS

51,484
14
155
166

Worth noting that a) WIndows is natively Unicode (UTF16 to be exact) which is why you can't pass a UTF8 string and more importantly, b) VC++ uses macros that target Ascii/Unicode functions and types through the `_t` prefix. Using `_tfopen` instead of `fopen()` or `_wfopen()` would have avoided the problem in the first place – Panagiotis Kanavos Sep 11 '17 at 11:14
@PanagiotisKanavos Good point about the `_tfopen` macro. Unfortunately these macros still stem from a time when Unicode-support meant making a binary decision between ASCII and UCS-2, so I am not sure of how much use they are in a codebase that is largely UTF-8 based. In particular I am not sure they would have helped catching this particular mistake, as UTF-8 and ASCII strings share the same types in C. – ComicSansMS Sep 11 '17 at 11:19
@PanagiotisKanavos can you explain a) pls? i dont get it – cevapsızcagri Sep 11 '17 at 11:20
And still do, even though C++ got char16_ and char32_t in C++ 11. Such codebases are generally in trouble when they encounter non-UTF8 text. As in all non-English text files, saved using the end user's native encoding. A case where you *can't* just force everything to be treated as UTF8 with an environment variable – Panagiotis Kanavos Sep 11 '17 at 11:23
@PanagiotisKanavos I have to read contains like that "C:\Users\cagri.ozcan\Desktop\ç\SysData" in sysdatapath.dat file.This file created manually.(notepad++) – cevapsızcagri Sep 11 '17 at 11:26
@cevapsızcagri Windows uses 2-byte Unicode since the first NT version. Otherwise it *wouldn't* be used all over the world, especially not in Eastern Europe and Asia. Ascii functions are only for convenience and migrating older code. A string is either Unicode or it isn't. The *file system, NTFS* supports Unicode paths. Only the *IO functions* work with ASCII, again for convenience. Nowadays, and for the last 10 years I'd say, if some code uses ASCII on Windows, there's a serious bug. Unless it's the tax service where they don't care /1 – Panagiotis Kanavos Sep 11 '17 at 11:28
@cevapsızcagri UTF16 can be detected very easily, even when you enter ANSI characters - the first byte of every character is always null. UTF8 though is indistinguishable from some ASCII string. It may be exactly the same if contains latin characters, or it may contain characters that may be valid in an ASCII codepage. Which means, mixing UTF8 and ASCII is error-prone. You can treat *all* strings as UTF8, but then your code will fail if you try to load a file encoded with a local codepage /2 – Panagiotis Kanavos Sep 11 '17 at 11:31
@cevapsızcagri (some) Linux applications get away with it by simply ignoring the local use case -just set an environment variable to UTF8. Never mind that Greek company that had to sell Turkish pillows to Bulgaria (true story). 4 codepages (3 local + 1 English) for the price of 1. /3 – Panagiotis Kanavos Sep 11 '17 at 11:33
@cevapsızcagri in Windows, there *should* be no ambiguity. Convert everything to Unicode when loading. You only care about the codepage when importing/exporting text. The language itself will complain if you forget to convert. Otherwise, you *have* to keep track of the codepage, always. Typically, this means that your code should accept a locale parameter. Or you can assume that your app will use `char` ASCII throughout and deal with codepages at the edges, without any support from the language or OS. – Panagiotis Kanavos Sep 11 '17 at 11:36
@ComicSansMS one can see the quaint UTF16/UTF8/ASCII dance performed live with R on Windows, R-Studio didn't store scripts as UTF8 by default until recently, the language and some packages are (mostly) Unicode, some packages are plain ASCII and some are ... half-and-half. The `write` function works with Unicode while `read` is ASCII, mangling the text. – Panagiotis Kanavos Sep 11 '17 at 11:40
@PanagiotisKanavos OMG really so much deeply information for beginner junior. I am really confused :) but really thnx for your caring and your answers. – cevapsızcagri Sep 11 '17 at 11:49

score 0 · Answer 2 · answered Oct 02 '17 at 12:37

0

fopen usually reads unicode chars. try to change the files encoding

answered Oct 02 '17 at 12:37

userdardemir

1

Cannot read to file(exist) with UNICODE

2 Answers2