String from NSInputStream is not valid utf8. How to convert to utf8 more 'lossy'

Question

I have an App that reads data from a server. Now and then, the data appears to be not valid UTF-8. If I convert from the byte array to an UTF8-String, the string appears nil. There must be some invalid not-UTF8 character in the byte array. Is there a way to 'lossy' convert the byte array to UTF8 and filter out only the invalid characters?

Any ideas?

My code looks like this:

- (void)stream:(NSStream *)theStream handleEvent:(NSStreamEvent)streamEvent {

switch (streamEvent){
    case NSStreamEventHasBytesAvailable:
    {
        uint8_t buffer[1024];
        int len;
        NSMutableData * inputData = [NSMutableData data];
        while ([directoryStream hasBytesAvailable]){
            len = [directoryStream read:buffer maxLength:sizeof(buffer)];
            if (len> 0) {
                [inputData appendBytes:(const void *)buffer length:len];
            }
        }
        NSString *directoryString = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
    }
    NSLog(@"directoryString: %@", directoryString);

    ...

Is there a way to do this conversion in a more 'lossy' way?

As you see I first append the chunks of data to an NSData value and do the conversion to utf8 when everything is read. This prevents that the (multi-byte) utf8 characters are split up resulting in even more invalid (empty) utf8 strings.

Maybe something like this: `NSMutableString *finalString = [[NSMutableString alloc] init]; while ([directoryStream hasBytesAvailable]){ len = [directoryStream read:buffer maxLength:sizeof(buffer)]; if (len> 0){for (int i = 0; i < len; i ++){NSData *data = [NSData dataWithBytes:&buffer[i] length:sizeof(uint8_t)];NSString *possibleString = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding]; if (possibleString)[finalString appendString:possibleString];}}}`. The idea if to test each time/char if the `NSData` to `NSString` is valid. — Larme, May 21 '15 at 15:02
What kind of server protocol is being used? How do you know when the stream has actually reached the end of the UTF-8 bytes? Do you know the byte count ahead of time, or is there a marker of some kind at the end of the data? You should not be converting the UTF-8 buffer to a String until you know for sure that you actually completed the full UTF-8 buffer. Waiting until `hasBytesAvailable` is false is not reliable enough if the raw data is streaming and getting delivered in pieces across multiple `NSStreamEventHasBytesAvailable` events. — Remy Lebeau, May 21 '15 at 16:54
Get an event, append the available data to your buffer, then check if the buffer has reached end-of-data before converting it to a String. Repeat as needed. — Remy Lebeau, May 21 '15 at 16:57
@Larme, that is not possible. UTF8 characters can be build of multiple bytes. You can not check per-byte if it is UTF8. The other answers, I have thought of that, but I do not have influence on the server. That is why I want to create a method that parses the string also when it contains an invalid UTF8 character — Wubbe, May 21 '15 at 20:28
@Wubbe: You could still check each character, not with only one byte, but by checking possibles values of consecutives uint8 if needed, according to the doc (http://en.wikipedia.org/wiki/UTF-8#Description), inspiring by this http://stackoverflow.com/questions/28890907/implement-a-function-to-check-if-a-string-byte-array-follows-utf-8-format — Larme, May 21 '15 at 20:33

Wubbe · Accepted Answer · 2015-05-23T13:43:34.317

It works! By combining the code snippet from Larme and the comment about the size of UTF-8 characters I managed to create a 'lossy' NSData to UTF-8 NSString conversion method.

+ (NSString *) data2UTF8String:(NSData *) data {

    // First try to do the 'standard' UTF-8 conversion 
    NSString * bufferStr = [[NSString alloc] initWithData:data
                                                 encoding:NSUTF8StringEncoding];

    // if it fails, do the 'lossy' UTF8 conversion
    if (!bufferStr) {
        const Byte * buffer = [data bytes];

        NSMutableString * filteredString = [[NSMutableString alloc] init];

        int i = 0;
        while (i < [data length]) {

            int expectedLength = 1;

            if      ((buffer[i] & 0b10000000) == 0b00000000) expectedLength = 1;
            else if ((buffer[i] & 0b11100000) == 0b11000000) expectedLength = 2;
            else if ((buffer[i] & 0b11110000) == 0b11100000) expectedLength = 3;
            else if ((buffer[i] & 0b11111000) == 0b11110000) expectedLength = 4;
            else if ((buffer[i] & 0b11111100) == 0b11111000) expectedLength = 5;
            else if ((buffer[i] & 0b11111110) == 0b11111100) expectedLength = 6;

            int length = MIN(expectedLength, [data length] - i);
            NSData * character = [NSData dataWithBytes:&buffer[i] length:(sizeof(Byte) * length)];

            NSString * possibleString = [NSString stringWithUTF8String:[character bytes]];
            if (possibleString) {
                [filteredString appendString:possibleString];
            }
            i = i + expectedLength;
        }
        bufferStr = filteredString;
    }

    return bufferStr;
}

If you have any comments, please let me know. Thanks Larme!

If last "character" is spillted in two between current stream and next stream, You may have a stackoverflow when you try to get the character with a length that goes over the buffer length. So check before if i+expected length (before doing `NSData character`) is in the range of the buffer. You may also want to keep if it goes over the buffer length to be the start of the buffer next time your delegate method is called. — Larme, May 23 '15 at 10:11

score 0 · Answer 2 · answered May 12 '22 at 11:27

I created an NSString category with a -[validUTF8String] method which, in case UTF8String returns NULL, strips invalid surrogate characters then calls UTF8String on that cleaned string:

@interface NSString (ValidUTF8String)

- (const char *)validUTF8String;
- (NSString *)stringByStrippingInvalidUnicode;  // warning: very inefficient! should only be called when we are sure that the string contains invalid Unicode, e.g. when -[UTF8String] is NULL

@end

@implementation NSString (ValidUTF8String)

- (const char *)validUTF8String;
{
    const char *result=[self UTF8String];
    if (!result)
    {
        result=[[self stringByStrippingInvalidUnicode] UTF8String];
        if (!result)
            result="";
    }
    return result;
}

#define isHighSurrogate(k)  ((k>=0xD800) && (k<=0xDBFF))
#define isLowSurrogate(k)   ((k>=0xDC00) && (k<=0xDFFF))

- (NSString *)stringByStrippingInvalidUnicode
{
    NSMutableString *fixed=[[self mutableCopy] autorelease];
    for (NSInteger idx=0; idx<[fixed length]; idx++)
    {
        unichar k=[fixed characterAtIndex:idx];
        if (isHighSurrogate(k))
        {
            BOOL nextIsLowSurrogate=NO;
            if (idx+1<[fixed length])
            {
                unichar nextK=[fixed characterAtIndex:idx+1];
                nextIsLowSurrogate=isLowSurrogate(nextK);
            }
            if (!nextIsLowSurrogate)
            {
                [fixed deleteCharactersInRange:NSMakeRange(idx, 1)];
                idx--;
            }
        }
        else if (isLowSurrogate(k))
        {
            BOOL previousWasHighSurrogate=NO;
            if (idx>0)
            {
                unichar previousK=[fixed characterAtIndex:idx-1];
                previousWasHighSurrogate=isHighSurrogate(previousK);
            }
            if (!previousWasHighSurrogate)
            {
                [fixed deleteCharactersInRange:NSMakeRange(idx, 1)];
                idx--;
            }
        }
    }
    return fixed;
}

@end

String from NSInputStream is not valid utf8. How to convert to utf8 more 'lossy'

2 Answers2

Linked