1

I am using a library which has a function that returns result strings encoded as UTF-16LE (I'm pretty sure) in a standard char *, as well as the number of bytes in the string. I would like to convert these strings to UTF-8. I tried the solution from this question: Convert UTF-16 to UTF-8 under Windows and Linux, in C which says to use iconv, however the result was that both input and output buffers wound up empty. What am I missing?

My input and output buffers are declared and initialized as follows:

char *resbuff=NULL;
char *outbuff=NULL;
int stringLen;
size_t outbytes=1024;
size_t inbytes;
size_t convResult;
...
//some loop and control code here
...
if (resbuff==NULL) {
    resbuff=(char *)malloc(1024);
    outbuff=(char *)malloc(1024);
}

I then call the library function to fill rebuff with data. Looking at the buffer in the debugger I can see the data in the buffer. For example, if the data is "test", I would see the following looking at the individual indexes of rebuff:

't','\0','e','\0','s','\0','t','\0'

Which I believe is UTF-16LE (other code using the same library would appear to confirm this), and stringlen now equals 8. I then try to convert that to UTF-8 using the following code:

iconv_t conv;
conv=iconv_open("UTF-8", "UTF-16LE");
inbytes=stringLen;
convResult=iconv(conv,&resbuff,&inbytes,&outbuff,&outbytes); //this does return 0
iconv_close(conv);

With the result that outbuff and resbuff both end up as null strings.

Note that I declare stringlen as an int rather than an unsigned long because that is what the library function is expecting.

EDIT: I tweaked my code slightly as per John Bollinger's answer below, but it didn't change the outcome.

EDIT 2: Ultimately the output from this code will be used in Python, so I'm thinking that while it might be uglier, I'll just perform the string conversion there. It just works.

Community
  • 1
  • 1
ibrewster
  • 3,482
  • 5
  • 42
  • 54
  • I think it's working for you already in C. You are misunderstanding the result. Most likely, you are missing the implications of `iconv()` updating the value of the output buffer pointer. – John Bollinger Nov 18 '14 at 18:14
  • @JohnBollinger So perhaps the original output buffer has the expected output, it's just that the pointer is no longer pointing to the original? – ibrewster Nov 18 '14 at 19:21
  • 1
    Yes, exactly so. `iconv()` should leave it pointing at the position in the buffer immediately following the converted data. That's why, as I included in my updated answer, you need to pass a pointer to a _copy_ of your output buffer pointer if you in fact need to retain the value of the original pointer (which is not always needed, but which you probably _do_ need). – John Bollinger Nov 19 '14 at 15:22
  • @JohnBollinger I ran some tests on this, and it does appear that this is the answer. The initial outbuff I declared does end up holding the properly converted string - I was just loosing the pointer to said buffer. – ibrewster Nov 19 '14 at 21:39

1 Answers1

2

You do not show the declaration or initialization of variables stringLen and outbytes, and your problem might well lie there. However, this ...

Note that I declare stringlen as an int rather than an unsigned long because that is what the library function is expecting.

... is very troubling. The iconv() function expects its third and fifth arguments to be of type size_t *, and lying to the compiler via a cast isn't going to make the code actually work if they are in fact different types. You should have something along these lines:

size_t in_bytes_left = (expression giving the total input length, in bytes);
size_t out_bytes_available = (expression giving the size of the output buffer);
char *input_temp = resbuff;
char *output_temp = outbuff;
int result;

result = iconv(conv, &input_temp, &in_bytes_left, &output_temp, &out_bytes_available);

Note, too, that you should check the return value to make sure the conversion was complete and successful (in which case the return value will be >= 0). If it is less than zero then the value of errno immediately after the call will tell you what kind of problem occurred.

Edited to add:

You originally said that the zero bytes were converted, but you now say that

outbuff and resbuff both end up as null strings.

which is not the same thing at all.

The iconv() function updates the pointers to the input and output buffers to facilitate converting a long input via multiple calls, the need for that being fairly common. That's why you must pass pointers to those pointers. If you don't want to lose the original values of these pointers then you should make and pass copies; I have updated my code above to demonstrate this.

Additionally, iconv() returns either an error indicator or a count of irreversibly-converted characters, not a count of the total number of converted characters. For valid UTF-16{,LE,BE} to UTF-8, there should never be any irreversible conversions. A return value of zero indicates that the specified number of input bytes were all successfully and reversibly converted to output bytes.

Note also that resbuff, at least, never was a C string. The null chars embedded in the data make a string interpretation inappropriate. Depending on how your input and output buffers were initialized, however, it could be that after iconv() finishes, *resbuff == '\0' and *outbuff == '\0' (referring to your own current code). I'd call those "empty" strings, by the way, not "null" strings. If you do really mean that iconv() leaves resbuff == 0 and outbuff == 0 (i.e. NULL pointers) then that would constitute a bug in iconv().

John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • I do show the declaration of stringlen, I just mistyped the capital l when putting in my code here. I did, however, neglect to show the declaration of out bytes. I'll edit to fix those. – ibrewster Nov 17 '14 at 20:24
  • Ok, I updated my code, and the question, as per your suggestions so I am passing a size_t for both arguments and checking the return code. I do get 0 back, so it appears to THINK the conversion is working, but it's not. – ibrewster Nov 17 '14 at 20:40
  • Yes, calling rebuff a C string is inappropriate - but there it is. The library call requires a char ** as an argument, and fills it with the string as indicated. That's what I have to work with. My apologies if my terminology is wrong, but yes: the buffers are filled with null values at the end, they do not become null pointers. And as I said, iconv does return 0, so it thinks it is converting all the characters, but the output buffer is still filled with null's. Dunno. – ibrewster Nov 18 '14 at 16:52
  • Oh, and I never said "zero bytes were converted" All I have said is that there are zero bytes in the output buffer. You are right - those aren't the same thing. That's why I never said the first one. – ibrewster Nov 18 '14 at 16:54
  • Sorry, I had to work from memory since you edited the question, but "zero bytes in the output buffer" is equivalent to "zero bytes are converted" in this case because UTF-16 does not have any shift sequences. – John Bollinger Nov 18 '14 at 18:13