Unescape a universal character name to the corresponding character in C

Question

NEW EDIT: Basically I've provided a example that isn't correct. In my real application the string will of course not always be "C:/Users/Familjen-Styren/Documents/V\u00E5gformer/20140104-0002/text.txt". Instead I will have a input window in java and then I will "escape" the unicode characters to a universal character name. And then it will be "unescaped" in C (I do this to avoid problems with passing multibyte characters from java to c). So here is a example where I actually ask the user to input a string (filename):

#include <stdio.h>
#include <string.h>

int func(const char *fname);

int main()
{
   char src[100];
   scanf("%s", &src);
   printf("%s\n", src);
   int exists = func((const char*) src);
   printf("Does the file exist? %d\n", exists);
   return exists;
}

int func(const char *fname)
{
    FILE *file;
    if (file = fopen(fname, "r"))
    {
        fclose(file);
        return 1;
    }
    return 0;
}

And now it will think the universal character names is just a part of the actual filename. So how do I "unescape" the universal character names included in the input?

FIRST EDIT: So I compile this example like this: "gcc -std=c99 read.c" where 'read.c' is my source file. I need the -std=c99 parameter because I'm using the prefix '\u' for my universal character name. If I change it to '\x' it works fine, and I can remove the -std=c99 parameter. But in my real application the input will not use the prefix '\x' instead it will be using the prefix '\u'. So how do I work around this?

This code gives the desired result but for my real application I can't really use '\x':

#include <stdio.h>
#include <string.h>

int func(const char *fname);

int main()
{
   char *src = "C:/Users/Familjen-Styren/Documents/V\x00E5gformer/20140104-0002/text.txt";
   int exists = func((const char*) src);
   printf("Does the file exist? %d\n", exists);
   return exists;
}

int func(const char *fname)
{
    FILE *file;
    if (file = fopen(fname, "r"))
    {
        fclose(file);
        return 1;
    }
    return 0;
}

ORIGINAL: I've found a few examples of how to do this in other programming languages like javascript but I couldn't find any example on how to do this in C. Here is a sample code which produces the same error:

#include <stdio.h>
#include <string.h>

int func(const char *fname);

int main()
{
   char *src = "C:/Users/Familjen-Styren/Documents/V\u00E5gformer/20140104-0002/text.txt";
   int len = strlen(src); /* This returns 68. */
   char fname[len];
   sprintf(fname,"%s", src);
   int exists = func((const char*) src);
   printf("%s\n", fname);
   printf("Does the file exist? %d\n", exists); /* Outputs 'Does the file exist? 0' which means it doesn't exist. */
   return exists;
}

int func(const char *fname)
{
    FILE *file;
    if (file = fopen(fname, "r"))
    {
        fclose(file);
        return 1;
    }
    return 0;
}

If I instead use the same string without universal character names:

#include <stdio.h>
#include <string.h>

int func(const char *fname);

int main()
{
   char *src = "C:/Users/Familjen-Styren/Documents/Vågformer/20140104-0002/text.txt";
   int exists = func((const char*) src);
   printf("Does the file exist? %d\n", exists); /* Outputs 'Does the file exist? 1' which means it does exist. */
   return exists;
}

int func(const char *fname)
{
    FILE *file;
    if (file = fopen(fname, "r"))
    {
        fclose(file);
        return 1;
    }
    return 0;
}

it will output 'Does the file exist? 1'. Which means it does indeed exist. But the problem is I need to be able to handle universal character. So how do I unescape a string which contains universal character names?

Thanks in advance.

What happens if you just use: `char *src = "C:/Users/Familjen-Styren/Documents/V\u00E5gformer/20140104-0002/text.txt";` without the sprintf?.. sprintf will not do encoding conversion and so may just drop characters. Also what encoding at you currently using? — odedsh, Mar 02 '14 at 17:34
@odedsh It still outputs that the file doesn't exist. I just wish it was that easy.. — Linus, Mar 02 '14 at 17:47
@odedsh I'm using MinGw 4.8.0 and I'm compiling it like this: "gcc -std=c99 read.c" Where read.c is my source file. And I'm not sure what encoding I'm using, how do I check? — Linus, Mar 02 '14 at 18:03
It is not clear what your real application expects to see in the input. Is it a sequence of ASCII characters `'\', 'u', '0', '0', 'E', '5'`? Or is it the Unicode character `'å'`? — n. m. could be an AI, Mar 02 '14 at 18:25
@n.m. Well, if you look at the working example without the universal character name it uses a basic string with the special character 'å' (Which is included in extend-ASCII) but in my real application the character would be in ASCII form (as a universal character name) and the 'å' would look like '\u00E5'. — Linus, Mar 02 '14 at 18:30
In memory `\u00E5` should comeout as 0xC3 0xA5 which is UTF-8 for that code point — odedsh, Mar 02 '14 at 18:35
The example is irrelevant. It does not input anything, it uses a hardcoded string. So let me reiterate. Do you expect your users to type `\u00E5` (press six keys) on the keyboard? or do you expect a sequence of these 6 characters in the input stream? If this is true, then you are trying to solve a wrong problem. — n. m. could be an AI, Mar 02 '14 at 18:36
@n.m. No, I think you misunderstood me. In my real application I'm actually using JNI and then passing a string into C, but before I do so, I want to use pure ASCII. So I convert all the unicode characters into a universal character name and then I want to convert them back into their corresponding unicode character again, I hope you understand what I mean. — Linus, Mar 02 '14 at 18:42
OK so if your real application will *input* a string, make your test application *input* a similar a string and not use a hardcoded one. Alternatively, simulate the input string by having your hardcoded string contain the same six characters `'\', 'u', '0', '0', 'E', '5'` you expect to see in the input. Note, for your hardcoded string to contain a `'\'` character your **source code** must contain `'\\'`. — n. m. could be an AI, Mar 02 '14 at 18:48
What you have done so far bears no relationship to the problem you are trying to solve. Your **source code** contains the universal character name, but the actual string in the actual program does not. The compiler has translated it to something else. If you want to translate universal character names at run time, you have to write code to do so. — n. m. could be an AI, Mar 02 '14 at 18:53
@n.m. Well, that's really what I'm trying to do, but my example wasn't all that great. My question is really how would I convert the universal character names in a string to their corresponding unicode string. For example if I input 'C:/Users/Familjen-Styren/Documents/V\u00E5gformer/20140104-0002/20140104-0002_1.mat' it will ignore the universal character name? — Linus, Mar 02 '14 at 18:58
The example isn't "not great". It's worse than nothing. It has prompted several people to try and solve a totally wrong problem for you. "how would I convert the universal character names in a string to their corresponding unicode string" --- you have to detect the `\` symbol, verify it is followed by `u` and then by 4 hexadecimal digits, and then convert these digits to a Unicode character in your encoding of choice (most likely UTF-8). If you are not ready to tackle this, consider not translating characters to their universal names in the first place, as it makes no sense whatsoever. — n. m. could be an AI, Mar 02 '14 at 19:10
@n.m. I'm sorry I'm rather new to the 'C' language and I thought it wouldn't matter wheter it was hardcoded or not. I just wanted to simplify my problem as much as I could. I guess I failed quiet bad. But the reason I do it is because I want to avoid the problems with passing multibyte characters from java to c (I've struggled with that problems for a long time already), so I thought it would be easier for me to use pure ASCII and pass that string to c. I added a new EDIT label which describes my problem better... — Linus, Mar 02 '14 at 19:18
So now the question is the same as : http://stackoverflow.com/questions/241148/simplest-way-to-convert-unicode-codepoint-into-utf-8. You have the codepoints and you need to convert it to the correct multibyte utf-8 representation. (Isn't it easier to pass UTF-8 to begin with?) — odedsh, Mar 02 '14 at 21:34
@odedsh Well, not really. I've already asked that [question](http://stackoverflow.com/questions/22054617/java-jni-passing-multibyte-characters-from-java-to-c) before and I didn't get it working, and I found a c library which can convert the universal character names back into utf-8. And I'm in the middle of learning it. — Linus, Mar 02 '14 at 21:44
What they are saying in the question you asked is the following: 1) windows printf implementation is not printing UTF-8 encoded characters correctly to the console. A windows issue not related. To see the character convert to wide chars and use wprinrf 2) Make sure you are passing UTF-8 encoded strings and not UTF-16LE encoded — odedsh, Mar 02 '14 at 22:19

chux - Reinstate Monica · Answer 1 · 2014-03-02T18:30:49.387

1

Wrong array size (forgot the .txt and \0 and that an encoded non-ASCII char takes up more than 1 byte.)

// length of the string without the universal character name. 
// C:/Users/Familjen-Styren/Documents/Vågformer/20140104-0002/text
// 123456789012345678901234567890123456789012345678901234567890123
//          1         2         3         4         5         6
// int len = 63;

// C:/Users/Familjen-Styren/Documents/Vågformer/20140104-0002/text.txt
int len = 100;


char *src = "C:/Users/Familjen-Styren/Documents/V\u00E5gformer/20140104-0002/text.txt";
char fname[len];
// or if you can use VLA
char fname[strlen(src)+1];

sprintf(fname, "%s", src);

edited Mar 02 '14 at 18:30

answered Mar 02 '14 at 18:17

chux - Reinstate Monica

143,097
13
135
256

Ah, sorry. I made this example very quickly and missed that part, I will update the question though. – Linus Mar 02 '14 at 18:21
@Linus Please re-verify that you still have a `å` problem once the buffer size is made sufficient. – chux - Reinstate Monica Mar 02 '14 at 18:23
I made a new update to my question (see the EDIT label). – Linus Mar 02 '14 at 18:24
@Linus 63+4+1 is too small. Use larger. Try printing the `strlen(src)`. Suspect code needs at least 63+4+1+1. – chux - Reinstate Monica Mar 02 '14 at 18:28

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

I'm reediting the answer in the hope to make it clearer. First of all I'm assuming you are familiar with this: http://www.joelonsoftware.com/articles/Unicode.html. It is required background knowledge when dealing with character encoding.

Now I'm starting with a simple test program I typed on my linux machine test.c

#include <stdio.h>
#include <string.h>
#include <wchar.h>
#define BUF_SZ 255
void test_fwrite_universal(const char *fname)
{
    printf("test_fwrite_universal on %s\n", fname);
    printf("In memory we have %d bytes: ", strlen(fname));
    for (unsigned i=0; i<strlen(fname); ++i) {
        printf("%x ", (unsigned char)fname[i]);
    }
    printf("\n");
    
    FILE* file = fopen(fname, "w");
    if (file) {
        fwrite((const void*)fname, 1, strlen(fname),  file);        
        fclose(file);
        file = NULL;
        printf("Wrote to file successfully\n");
    }
}

int main()
{
    test_fwrite_universal("file_\u00e5.txt");
    test_fwrite_universal("file_å.txt");   
    test_fwrite_universal("file_\u0436.txt");   
    return 0;
}

the text file is encoded as UTF-8. On my linux machine my locale is en_US.UTF-8 So I compile and run the program like this:

gcc -std=c99 test.c -fexec-charset=UTF-8 -o test

test

test_fwrite_universal on file_å.txt
In memory we have 11 bytes: 66 69 6c 65 5f c3 a5 2e 74 78 74 
Wrote to file successfully
test_fwrite_universal on file_å.txt
In memory we have 11 bytes: 66 69 6c 65 5f c3 a5 2e 74 78 74 
Wrote to file successfully
test_fwrite_universal on file_ж.txt
In memory we have 11 bytes: 66 69 6c 65 5f d0 b6 2e 74 78 74 
Wrote to file successfully

The text file is in UTF-8, my locale is working of of UTF-8 and the execution character set for char is UTF-8. In main I call the function fwrite 3 times with character strings. The function prints the strings byte by byte. Then writes a file with that name and write that string into the file.

We can see that "file_\u00e5.txt" and "file_å.txt" are the same: 66 69 6c 65 5f c3 a5 2e 74 78 74 and sure enough (http://www.fileformat.info/info/unicode/char/e5/index.htm) the UTF-8 representation for code point +00E5 is: c3 a5 In the last example I used \u0436 which is a Russian character ж (UTF-8 d0 b6)

Now lets try the same on my windows machine. Here I use mingw and I execute the same code:

C:\test>gcc -std=c99 test.c -fexec-charset=UTF-8 -o test.exe

C:\test>test

test_fwrite_universal on file_├Ñ.txt
In memory we have 11 bytes: 66 69 6c 65 5f c3 a5 2e 74 78 74
Wrote to file successfully
test_fwrite_universal on file_├Ñ.txt
In memory we have 11 bytes: 66 69 6c 65 5f c3 a5 2e 74 78 74
Wrote to file successfully
test_fwrite_universal on file_╨╢.txt
In memory we have 11 bytes: 66 69 6c 65 5f d0 b6 2e 74 78 74
Wrote to file successfully

So it looks like something went horribly wrong printf is not writing the characters properly and the files on the disk also look wrong. Two things worth noting: in terms of byte values the file name is the same in both linux and windows. The content of the file is also correct when opened with something like notepad++

The reason for the problem is the C Standard library on windows and the locale. Where on linux the system locale is UTF-8 on windows my default locale is CP-437. And when I call functions such as printf fopen it assumes the input is in CP-437 and there c3 a5 are actually two characters.

Before we look at a proper windows solution lets try to explain why you have different results in file_å.txt vs file_\u00e5.txt. I believe the key is the encoding of your text file. If I write the same test.c in CP-437:

C:\test>iconv -f UTF-8 -t cp437 test.c > test_lcl.c

C:\test>gcc -std=c99 test_lcl.c -fexec-charset=UTF-8 -o test_lcl.exe

C:\test>test_lcl

test_fwrite_universal on file_├Ñ.txt
In memory we have 11 bytes: 66 69 6c 65 5f c3 a5 2e 74 78 74
Wrote to file successfully
test_fwrite_universal on file_å.txt
In memory we have 10 bytes: 66 69 6c 65 5f 86 2e 74 78 74
Wrote to file successfully
test_fwrite_universal on file_╨╢.txt
In memory we have 11 bytes: 66 69 6c 65 5f d0 b6 2e 74 78 74
Wrote to file successfully

I now get a difference between file_å and file_\u00e5. The character å in the file is actually encoded as 0x86. Notice that this time the second string is 10 characters long not 11. If we look at the file and tell Notepad++ to use UTF-8 we will see a funny result. Same goes to the actual data written to the file.

Finally how to get the damn thing working on windows. Unfortunately It seems that it is impossible to use the standard library with UTF-8 encoded strings. On windows you can't set the C locale to that. see: What is the Windows equivalent for en_US.UTF-8 locale?.

However we can work around this with wide characters:

#include <stdio.h>
#include <string.h>
#include <windows.h>
#define BUF_SZ 255
void test_fopen_windows(const char *fname)
{
    wchar_t buf[BUF_SZ] = {0};
    int sz = MultiByteToWideChar(CP_UTF8, 0, fname, strlen(fname), (LPWSTR)buf, BUF_SZ-1);
    wprintf(L"converted %d characters\n", sz);
    wprintf(L"Converting to wide characters %s\n", buf);
    FILE* file =_wfopen(buf, L"w");
    if (file) {
        fwrite((const void*)fname, 1, strlen(fname),  file);        
        fclose(file);
        wprintf(L"Wrote file %s successfully\n", buf);
    }
}


int main()
{
    test_fopen_windows("file_\u00e5.txt");
    return 0;
}

To compile use:

gcc -std=gnu99 -fexec-charset=UTF-8 test_wide.c -o test_wide.exe

_wfopen is not ANSI compliant and -std=c99 actually means STRICT_ANSI so you should use gnu99 to have that function.

Okay, but actually fopen() did seem to work when I tried a hardcoded UTF-8 string. And if you wish to say answer the question on how to pass a UTF-8 string via JNI I have asked the question [here](http://stackoverflow.com/questions/22054617/java-jni-passing-multibyte-characters-from-java-to-c). This answer will be accepted what so ever because you've helped me quiet alot already, but I'm afraid I will be stuck on this problem for a while.. Thank you. — Linus, Mar 02 '14 at 22:17
The only reason I can think that the literal 'Vågformer' would work is if you are in windows ANSI codepage 437 and å is actually 0x86 not utf-8 at all — odedsh, Mar 03 '14 at 05:21
@Linus I rewrote the answer with some more information I hope this time it is useful. — odedsh, Mar 03 '14 at 10:26
This was some very useful information, but what would happen if you use another charset like ISO/IEC 8859-1 or the windows default charset: Windows-1252? — Linus, Mar 03 '14 at 14:40
Then you need to encode the string in that charset and you will get different results. In Code Page 1252 the code for 0x86 is a small dagger / cross. And your character is 0xE5. However some possible characters will be missing obviously — odedsh, Mar 03 '14 at 15:20
OK, thanks. By the way, what do you mean by: " If I write the same test.c in CP-437" ? — Linus, Mar 03 '14 at 15:25
on windows _wfopen seems to be the only way to get a truly global reach with file names in every possible language — odedsh, Mar 03 '14 at 15:25
Well, that doesn't apply for external I/O libraries, does it? I'm using something called MatIO and it doesn't accept wide-character strings. And I'm not planning to rewrite the source code for the library. Alternativly I might contact the owner about my problem. — Linus, Mar 03 '14 at 15:31
Alright, I've looked at the source code and it seems to not tolerate wide character strings afterall. But it also looks like it is very easy to implement, because it is using the ordinary fopen function, and therefor I should be able to change it to being able to open wide character file paths. Any thoughts on this? — Linus, Mar 03 '14 at 16:25
Seems like your options are: 1) When possible convert the UTF-8 stream to a codepage that supports all characters you have in the file name. setlocale to that same codepage.. It will take some tweaking 2) Make sure file names are in plaing ASCII 3) Contact the library owner to provide an open function that accepts a FILE 4) make your own change in Mat_Open it seems to be very local change in mat.c however that is a headache when they update the library / Someone else takes over your code — odedsh, Mar 03 '14 at 16:26
Yeah okay, I think I'm going to try option 4, and perhaps option 3. But I'm not sure he will answer. There might be a patch for it, who knows. — Linus, Mar 03 '14 at 16:30

Unescape a universal character name to the corresponding character in C

2 Answers2