open() function in Linux with extended characters (128-255) returns -1 error

Question

When i try to create a file in LINUX using open() function, i get an error '-1' for the filename that contains extended character (ex: Björk.txt). Here the file contains a special character ö (ASCII 148)

I am using the below code:

char* szUnixPath

/home/user188/Output/Björk.txt

open(szUnixPath, locStyle, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);

I always get a -1 error, and NO FILE is created.

As the OS encounters the ASCII 148, it throws an error.

The same function works perfectly fine if i use a tilde ~ (ASCII 126, example: Bj~rk.txt) or any other character below ASCII value 128.

can somebody explain why do i get the -1 error only for filename having special character ranging between 128-255 ?

How do you input the name of the file? Input from the console or a GUI? Hardcoded in the source? Are you sure the encoding of the filename you pass to `open` is the same as the encoding used by the filesystem for its filenames? — Some programmer dude, Oct 20 '17 at 12:47
Try comparing the behavior of `char bjork1[] = "Bj\366rk"` and `char bjork2[] = "Bj\303\266rk"`. — Steve Summit, Oct 20 '17 at 12:47
"*character ö (ASCII 148)*" <- ö doesn't have an ASCII code. As for the question, on Linux, filenames are just byte sequences, so you need to use the same encoding that was used creating the file. If `ls` shows the name correctly, this is the encoding of your current locale, just type `locale` to find out. — , Oct 20 '17 at 12:49
This definitely should work just fine on linux. What is the actual code you have? — Art, Oct 20 '17 at 12:57
You are surely making an incorrect association between failure of `fopen()` and characters with value greater than 127 generally. There is no inherent incompatibility there, but you do need to encode all characters of the filename in the same way that they are encoded by the OS in directory entries. — John Bollinger, Oct 20 '17 at 12:57
@adan as said it seems strange you probably have a locale error you can try this code on your machine and see if you get an error with no local set https://wandbox.org/permlink/aPKLDoTp4VYvHhLk — Pierrot, Oct 20 '17 at 13:21
I can see the locale as below: LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" — adam, Oct 20 '17 at 13:33
@adam well, then the UTF-8 encoding of `ö` (2 bytes!) would work. To be absolutely sure, try [or523's code](https://stackoverflow.com/a/46849937/2371524). — , Oct 20 '17 at 13:47
@SteveSummit char bjork2[] = "Bj\303\266rk" is what i need. My code gives me "Bj\366rk" which is wrong. What have i missed? — adam, Oct 20 '17 at 20:45
@Pierrot This is what i get as the output: opening Björk.txt with locale en_US.UTF-8 opening Björk.txt with locale en_US.UTF8, I don't get an error — adam, Oct 20 '17 at 20:55
@adam Is the "wrong" version of the string coming from user-typed input, or something? If so, I guess you'll have to figure out how to set the locale of the input system to UTF-8. (But, yes, I see from the environment variables you posted that the locale seems to be UTF-8 already. But there may be other settings having to do with your keyboard or terminal window or something.) — Steve Summit, Oct 20 '17 at 21:10
My program extracts the file with special character from inside of a zip file and reads its characters. Then finally creates a new file with the same name — adam, Oct 20 '17 at 22:26

score 1 · Answer 1 · answered Oct 20 '17 at 13:17

I recommend just trying yourself to see what bytes this name contains.

Create the file in a directory, then run the following simple C program:

#include <dirent.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int main(void)
{
    /* Open directory */
    DIR * currdir = opendir(".");

    /* Iterate over files */
    struct dirent * directory_entry = NULL;
    while (NULL != (directory_entry = readdir(currdir)))
    {
        char * entry_name = directory_entry->d_name;
        printf("Directory entry: %s\n", entry_name);
        printf("Name bytes (len: %d):\n", strlen(entry_name));
        for (size_t i = 0; i < strlen(entry_name); ++i)
        {
            printf("\tname[%d] = %d\n", i, entry_name[i]);
        }
    }

    return 0;
}

We can easily see in the output that 'Björk' length is 6-bytes. And we can see these bytes values:

Directory entry: Björk
Name bytes (len: 6):
    name[0] = 66
    name[1] = 106
    name[2] = -61
    name[3] = -74
    name[4] = 114
    name[5] = 107

You are right. I get the same output but i am just curious to know that how the special character ö value are negative -61 and -74 — adam, Oct 20 '17 at 16:23
@adam: That is the UTF-8 encoding of ö, incorrectly printed as signed values. (The normal presentation would be hexadecimal C3 B6, which is the 2s complement representation of the integers shown in this answer.) — rici, Oct 20 '17 at 16:45
In my code i am getting the output as 'Bj\224rk.txt' while using the output charset UTF8. can somebody tell where am i going wrong? I have already set the character encoding to UTF8 — adam, Oct 20 '17 at 17:01

score 0 · Answer 2 · answered Oct 20 '17 at 13:25

0

Filenames in Linux are generally specified in UTF-8, not CP437. The open is failing because the filename you're passing doesn't match the one in the OS.

Try opening this file instead: /home/user188/Output/Bj\xc3\xb6rk.txt. This is the special character encoded in UTF-8 as two bytes.

answered Oct 20 '17 at 13:25

Mark Ransom

299,747
42
398
622

"*are generally specified in UTF-8*" <- not really, although common... they are generally just opaque byte sequences, using whatever encoding was used creating them. – Oct 20 '17 at 13:31
@felix: that depends on the filesystem, no? – rici Oct 20 '17 at 13:36
@rici some filesystems have options for converting filename encodings, but that's not generally the case. The userspace API just treats the names as byte sequences. – Oct 20 '17 at 13:38
@felix filesystem samba, with unix encoding set to utf-8. Attempt to use a filename string which is invalid utf-8. Result? – rici Oct 20 '17 at 14:16
@rici yes, smbfs/cifs is one of those that do translations. The result should be a simple `ENOENT`. – Oct 20 '17 at 14:18
@FelixPalmen but if you use `ls` what encoding is used to display the filename? The filename needs an encoding even if the filesystem doesn't care, and in Linux the encoding is extremely likely to be UTF-8. – Mark Ransom Oct 20 '17 at 14:25
@MarkRansom `ls` (probably, I didn't check the code) doesn't care and just "shovels bytes" to `stdout`. Yes, nowadays very likely to find UTF-8, but there's no guarantee and Linux supports a lot of other encodings as well. If you switch between encodings and create files, you end up with a horrible mess of differently encoded file names -- such things happened to Linux users ;) – Oct 20 '17 at 14:29
@felix: that's why i said it depends on the *filesystem*, because it is the filesystem which interprets (and possibly rejects) the filename string. And I don't believe samba/cifs is uncommon, although I no longer use it. – rici Oct 20 '17 at 14:33
@rici it's a special case because it interfaces with a windows server and on windows, filenames **do** have encoding information. A typical Linux filesystem will just not care. Btw, if configured correctly (the "unix encoding" set to the same encoding your locale is using), what smbfs does is completely transparent to the caller. – Oct 20 '17 at 14:35
@felix: Exactly the same thing would happen if you use a mounted NTFS partition on your local machine (which is something which does apply to one of the machines I use). Some Linux filesystems don't care; others do. *It depends on the filesystem.* (And SMBFS will not handle an invalid UTF-8 encoding, regardless of your locale settings. It is just less likely to happen.) – rici Oct 20 '17 at 14:43
@rici SMBFS will handle whatever encoding you configure, which is necessary because the windows server DOES use encodings in filenames. NTFS is another exception for a similar reason: it's a windows filesystem. Genuine Unix filesystems don't do such things. – Oct 21 '17 at 06:29

open() function in Linux with extended characters (128-255) returns -1 error

2 Answers2

Linked