How to display persian script through unicode

Question

Someone please help me displaying this string in persian script: "\u0622\u062f\u0631\u0633 \u0627\u06cc\u0645\u06cc\u0644"

I have tried using

NSData *data = [yourtext dataUsingEncoding:NSUTF8StringEncoding];
NSString *decodevalue = [[NSString alloc] initWithData:dataencoding:NSNonLossyASCIIStringEncoding];

and this gets returned: u0622u062fu0631u0633 u0627u06ccu0645u06ccu0644

I want the same solution for objective C: https://www.codeproject.com/Questions/714169/Conversion-from-Unicode-to-Original-format-csharp

Questions seeking debugging help ("why isn't this code working?") must include the desired behavior, a specific problem or error and the shortest code necessary to reproduce it in the question itself. Questions without a clear problem statement are not useful to other readers. See: [How to create a Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve). — Willeke, May 01 '18 at 14:06
what's inside "yourtext" ? where you get it from? if it is already in UTF-8, why can't you just display it? what gets printed with `NSLog(@"%@", yourtext)`? — battlmonstr, May 01 '18 at 21:57
@battlmonstr this is the log that is printed: u0622u062fu0631u0633 u0627u06ccu0645u06ccu0644 I amstill stuck in it. Please help. — Ghazalah, May 02 '18 at 10:00
How do you want to display the string, in a control or in a custom view? Which OS? — Willeke, May 02 '18 at 11:25
`po @"\u0622\u062f\u0631\u0633 \u0627\u06cc\u0645\u06cc\u0644"` logs "آدرس ایمیل", I don't know if this is Perian but Google Translates it into "e-mail". `NSTextField` displays "آدرس ایمیل". — Willeke, May 02 '18 at 11:34
@Willeke I wanted to display it in a UI element like a button or a label. The answer given by BattleMonstr below worked for me just fine. — Ghazalah, May 02 '18 at 13:40
Was the question how to convert `"\\u0622\\u062f"` to `"\u0622\u062f"`? — Willeke, May 02 '18 at 14:03
No, I just wanted the text to be displayed in my UI elements in the persian script. — Ghazalah, May 02 '18 at 19:16
@Willeke, yes you're right. @Ghazalah `\u0622\u062f` gets translated to Persian script by compiler. — battlmonstr, May 02 '18 at 22:23

score 1 · Accepted Answer · answered May 02 '18 at 12:31

I assume that your input string has backslash-escaped codes (as if it was in a source code file verbatim), and you want to parse the escape sequences it into a unicode string, and also want to preserve the unescaped characters as they are.

This is what I've came up with:

NSError *badRegexError;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:@"(\\\\u([a-f0-9]{4})|.)" options:0 error:&badRegexError];
if (badRegexError) {
    NSLog(@"bad regex: %@", badRegexError);
    return;
}

NSString *input = @"\\u0622\\u062f\\u0631\\u0633 123 test -_- \\u0627\\u06cc\\u0645\\u06cc\\u0644";
NSMutableString *output = [NSMutableString new];
[regex enumerateMatchesInString:input options:0 range:NSMakeRange(0, input.length)
                     usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop)
{
    NSRange codeRange = [result rangeAtIndex:2];
    if (codeRange.location != NSNotFound) {
        NSString *codeStr = [input substringWithRange:codeRange];
        NSScanner *scanner = [NSScanner scannerWithString:codeStr];
        unsigned int code;
        if ([scanner scanHexInt:&code]) {
            unichar c = (unichar)code;
            [output appendString:[NSString stringWithCharacters:&c length:1]];
        }
    } else {
        [output appendString:[input substringWithRange:result.range]];
    }
}];

NSLog(@"  actual: %@", output);
NSLog(@"expected: %@", @"\u0622\u062f\u0631\u0633 123 test -_- \u0627\u06cc\u0645\u06cc\u0644");

Explanation

This is using a regex that finds blocks of 6 characters like \uXXXX, for example \u062f. It extracts the code as a string like 062f, and then uses NSScanner.scanHexInt to convert it to a number. It assumes that this number is a valid unichar, and builds a string from it.

Note \\\\ in the regex, because first the objc compiler one layer of slashes, and it becomes \\, and then the regex compiler removes the 2nd layer of slashes and it becomes \ which is used for exact matching. If you have just "u0622u062f..." (without slashes), try removing \\\\ from the regex.

The second part of the regex (|.) treats non-escaped characters as is.

Caveats

You also might want to make the matching case insensitive by setting proper regex options.

This doesn't handle invalid character codes.

This is not the most performant solution, and you'd better use a proper parsing library to do this at scale.

How to display persian script through unicode

2 Answers2

Explanation

Caveats

Related docs and links