0

I am working with a Objective-C Application, specifically I am gathering the dictionary representation of NSUserDefaults with this code:

NSUserDefaults *defaults = [NSUserDefaults standardUserDefaults];

NSDictionary *userDefaultsDict = [defaults dictionaryRepresentation];

While enumerating keys and objects of the resulting dict, sometimes I find a kind of opaque string that you can see in the following picture:

enter image description here

So it seems like an encoding problem.

If I try to print description of the string, the debugger correctly prints:

Printing description of obj:
tsuqsx

However, if I try to write obj to a file, or use it in any other way, I get an unreadable output like this:

enter image description here

What I would like to achieve is the following:

  1. Detect in some way that the string has the encoding problem.

  2. Convert the string to UTF8 encoding to use it in the rest of the program.

Any help is greatly appreciated. Thanks

EDIT: Very Hacky possible Solution that helps explaining what I am trying to do.

After trying all possible solutions based on dataUsingEncoding and back, I ended up with the following solution, absolutely weird, but I post it here, in the hope that it can help somebody to guess the encoding and what to do with unprintable characters:

- (BOOL)isProblematicString:(NSString *)candidateString {

     BOOL returnValue = YES;

     if ([candidateString length] <= 2) {
         return NO;
     }

     const char *temp = [candidateString UTF8String];

     long length = temp[0];
   
        char *dest = malloc(length + 1);
   
        long ctr = 1;
   
        long usefulCounter = 0;
        for (ctr = 1;ctr <= length;ctr++) {
       
           if ((ctr - 1) % 3 == 0) {
              memcpy(&dest[ctr - usefulCounter - 1],&temp[ctr],1);
           } else {
               if (ctr != 1 && ctr < [candidateString length]) {
                   if (temp[ctr] < 0x10 || temp[ctr] > 0x1F) {
                       returnValue = NO;
                   }
           }
               usefulCounter += 1;
           }
       
       }
    memset(&dest[length],0,1);
    free(dest);

    return returnValue;
}

- (NSString *)utf8StringFromUnknownEncodedString:(NSString*)originalUnknownString {                       

    const char *temp = [originalUnknownString UTF8String];

    long length = temp[0];

    char *dest = malloc(length + 1);

    long ctr = 1;

    long usefulCounter = 0;
    for (ctr = 1;ctr <= length;ctr++) {
    
        if ((ctr - 1) % 3 == 0) {
            memcpy(&dest[ctr - usefulCounter - 1],&temp[ctr],1);
        } else {
            usefulCounter += 1;
        }
    
    }
    memset(&dest[length],0,1);

    NSString *returnValue = [[NSString alloc] initWithUTF8String:dest];
    free(dest);


    return returnValue;
}

This returns me a string that I can use to build a full UTF8 string. I am looking for a clean solution. Any help is greatly appreciated. Thanks

Alfonso Tesauro
  • 1,730
  • 13
  • 21
  • It's a perfectly valid string and has no encoding problems. `\x1a` is a non-printable character (substitute, ^Z). It includes other non-printable characters as well. This string is a valid `AppleMapID` and it comes from the `/Library/Preferences/.GlobalPreferences.plist`. The question here is - what are you trying to do? Is this some kind of accidental discovery and you're curious? Or are you going to work (modify, read) with this value? – zrzka Aug 12 '20 at 09:48
  • Thanks a lot for your interest. I am trying to obtain a string representation of the dictionary entirely in UTF8. The last picture shows the result of my obtained string, saved to disk, and reopened in BBEdit. I need to get a string representation in entirely valid UTF8. Thanks again – Alfonso Tesauro Aug 12 '20 at 12:32

1 Answers1

2

We're talking about a string which comes from the /Library/Preferences/.GlobalPreferences.plist (key com.apple.preferences.timezone.new.selected_city).

NSString *city = [[NSUserDefaults standardUserDefaults]
                  stringForKey:@"com.apple.preferences.timezone.new.selected_city"];
NSLog(@"%@", city); // \^Zt\^\\^]s\^]\^\u\^V\^_q\^]\^[s\^W\^Zx\^P
(lldb) p [city description]
(__NSCFString *) $1 = 0x0000600003f6c240 @"\x1at\x1c\x1ds\x1d\x1cu\x16\x1fq\x1d\x1bs\x17\x1ax\x10"

What I would like to achieve is the following:

  1. Detect in some way that the string has the encoding problem.
  2. Convert the string to UTF8 encoding to use it in the rest of the program.

&

After trying all possible solutions based on dataUsingEncoding and back.

This string has no encoding problem and characters like \x1a, \x1c, ... are valid characters. You can call dataUsingEncoding: with ASCII, UTF-8, ... but all these characters will still be present. They're called control characters (or non-printing characters). The linked Wikipedia page explains what these characters are and how they're defined in ASCII, extended ASCII and unicode.

What you're looking for is a way how to remove control characters from a string.

Remove control characters

We can create a category for our new method:

@interface NSString (ControlCharacters)

- (NSString *)stringByRemovingControlCharacters;

@end

@implementation NSString (ControlCharacters)

- (NSString *)stringByRemovingControlCharacters {
    // TODO Remove control characters
    return self;
}

@end

In all examples below, the city variable is created in this way ...

NSString *city = [[NSUserDefaults standardUserDefaults]
                  stringForKey:@"com.apple.preferences.timezone.new.selected_city"];

... and contains @"\x1at\x1c\x1ds\x1d\x1cu\x16\x1fq\x1d\x1bs\x17\x1ax\x10". Also all examples below were tested with the following code:

NSString *cityWithoutCC = [city stringByRemovingControlCharacters];
// tsuqsx
NSLog(@"%@", cityWithoutCC);
// {length = 6, bytes = 0x747375717378}
NSLog(@"%@", [cityWithoutCC dataUsingEncoding:NSUTF8StringEncoding]);

Split & join

One way is to utilize the NSCharacterSet.controlCharacterSet. There's a stringByTrimmingCharactersInSet: method (NSString), but it removes these characters from the beginning/end only, which is not what you're looking for. There's a trick you can use:

- (NSString *)stringByRemovingControlCharacters {
    NSArray<NSString *> *components = [self componentsSeparatedByCharactersInSet:NSCharacterSet.controlCharacterSet];
    return [components componentsJoinedByString:@""];
}

It splits the string by control characters and then joins these components back. Not a very efficient way, but it works.

ICU transform

Another way is to use ICU transform (see ICU User Guide). There's a stringByApplyingTransform:reverse: method (NSString), but it only accepts predefined constants. Documentation says:

The constants defined by the NSStringTransform type offer a subset of the functionality provided by the underlying ICU transform functionality. To apply an ICU transform defined in the ICU User Guide that doesn't have a corresponding NSStringTransform constant, create an instance of NSMutableString and call the applyTransform:reverse:range:updatedRange: method instead.

Let's update our implementation:

- (NSString *)stringByRemovingControlCharacters {
    NSMutableString *result = [self mutableCopy];
    [result applyTransform:@"[[:Cc:] [:Cf:]] Remove"
                   reverse:NO
                     range:NSMakeRange(0, self.length)
              updatedRange:nil];
    return result;
}

[:Cc:] represents control characters, [:Cf:] represents format characters. Both represents the same character set as the already mentioned NSCharacterSet.controlCharacterSet. Documentation:

A character set containing the characters in Unicode General Category Cc and Cf.

Iterate over characters

NSCharacterSet also offers the characterIsMember: method. Here we need to iterate over characters (unichar) and check if it's a control character or not.

Let's update our implementation:

- (NSString *)stringByRemovingControlCharacters {
    if (self.length == 0) {
        return self;
    }

    NSUInteger length = self.length;
    unichar characters[length];
    [self getCharacters:characters];
    
    NSUInteger resultLength = 0;
    unichar result[length];
    
    NSCharacterSet *controlCharacterSet = NSCharacterSet.controlCharacterSet;
    
    for (NSUInteger i = 0 ; i < length ; i++) {
        if ([controlCharacterSet characterIsMember:characters[i]] == NO) {
            result[resultLength++] = characters[i];
        }
    }
    
    return [NSString stringWithCharacters:result length:resultLength];
}

Here we filter out all characters (unichar) which belong to the controlCharacterSet.

Other ways

There're other ways how to iterate over characters - for example - Most efficient way to iterate over all the chars in an NSString.

BBEdit & others

Let's write this string to a file:

NSString *city = [[NSUserDefaults standardUserDefaults]
                  stringForKey:@"com.apple.preferences.timezone.new.selected_city"];

[city writeToFile:@"/Users/zrzka/city.txt"
       atomically:YES
         encoding:NSUTF8StringEncoding
            error:nil];

It's up to the editor how all these controls characters are handled/displayed. Here's en example - Visual Studio Code.

View - Render Control Characters off:

enter image description here

View - Render Control Characters on:

enter image description here

BBEdit displays question marks (upside down), but I'm sure there's a way how to toggle control characters rendering. Don't have BBEdit installed to verify it.

zrzka
  • 20,249
  • 5
  • 47
  • 73
  • Outstanding Answer zika ! all the methods work. The only left thing to do is to determine if a string needs such a conversion. Do you think I can call your code on all the strings and it will not change the strings that do not contain the control characters ? Thanks – Alfonso Tesauro Aug 13 '20 at 11:26
  • You can call it on all strings. But be aware that the control character set contains characters like `\n`, `\r`, `\t`, ... Do you want to keep them? I don't know. If the answer is yes, create a [`NSMutableCharacterSet`](https://developer.apple.com/documentation/foundation/nsmutablecharacterset/1414334-controlcharacterset?language=objc) & remove them via the [`removeCharactersInString:`](https://developer.apple.com/documentation/foundation/nsmutablecharacterset/1414812-removecharactersinstring?language=objc) method. Like - `[controlCharacterSet removeCharactersInString:@"\n\r\t"]`. – zrzka Aug 13 '20 at 11:55
  • Hello zrzka, I was thinking, is there a way to create a string literal from that problematic string ? Thanks a lot. – Alfonso Tesauro Aug 14 '20 at 20:49
  • It's in the answer, your question, ... `@"\x1at\x1c\x1ds\x1d\x1cu\x16\x1fq\x1d\x1bs\x17\x1ax\x10"`. – zrzka Aug 15 '20 at 11:10