1

I am coming from this post Swift 4 JSON String with unknown UTF8 "�" character is not convertible to Data/ Dictionary but meanwhile I was able to isolate the issue to a 10-character-string.

Short intro: one user's app did not show any content. Looking at his 6kb of data in plain text with TextWrangler I found 2 red question marks

enter image description here.

I tried to cut some chunks of the base64-encoded data around the question marks and convert them to Data which didn't work. As soon as I removed the bits from the red question mark from the chunks it seemed to work again. Please take a look at my following Playground example:

//those do NOT work
let toEndBracket = "ACAAKgBVAFMAQQAqACAnlgAg2DwAIgB9AF0A" // *USA* ' <"}]//
let toMidBracket = "ACAAKgBVAFMAQQAqACAnlgAg2DwAIgB9"     // *USA* ' <"}//
let toCarrot =     "ACAAKgBVAFMAQQAqACAnlgAg2DwA"         // *USA* ' <//
let toSpace =      "ACAAKgBVAFMAQQAqACAnlgAg"             // *USA* ' //

//but this one WORKS
let toApostrophe = "ACAAKgBVAFMAQQAqACAn"                 // *USA* '//
//(basically the last one is without the space before the carrot, I've added the slashes after it to emphasize that)
//clear strings taken from https://www.base64decode.org/ using the UTF-8 setting WITHOUT "Live mode".

if let textData = Data(base64Encoded: toApostrophe) {
    print("Data created")   //works for all of them
    print(textData)
    if let decodedString = String(data: textData, encoding: .utf8) {
        print("WORKED!!!")  //only happens for the toApostrophe
        print(decodedString)
    } else {
        print("DID NOT WORK")
    }
}

So it basically fails as soon as soon as it contains lgAg. Replacing this with something like U29t does make the small strings work again but I can't do this in production code as I am sure my examples aren't the only occurrences of this issue. I don't care what happens with the original characters/ symbols/ emojis that are causing this, if there was a way to just "ignore" them that would be more than helpful already!

Here is another example of where this occurs:

//OTHER SYMBOL WITH SAME BEHAVIOR
//not working
let secondFromSpace =  "ACDYPAAiACwA"       // <",//

//WORKING
let secondFromCarrot = "PAAiACwA"           //<",//

Here is the original text in its habitat, a messenger message saying "USA" with an emoji hence the "USA" in my examples texts and my suspicion it's the emojis that make it break:

enter image description here

I'd be grateful if someone can tell me how I can "clean up" the base64 string so it's convertible to data again. It might also be due to some weird encoding with some of the emojis but for the very most cases, the app receives and displays content with emojis just fine.


I have finally figured out why this is happening. It's not a swift-side solution to my problem but now it makes at least some sense. For previews of new content I cut off strings to match the viewport of the browser. This particular unlucky user has had the USA flag emoji on the edge of the display bezel. Never would I have thought of emojis consisting of multiple letters and JavaScript's substring() decapitating them. Take a look at the picture, this explains where the character comes from etc.

I would still appreciate an answer as to how to avoid/ignore/catch that in Swift but to every poor soul running into this issue I hope you will stumble across this thread.

enter image description here

user2875404
  • 3,048
  • 3
  • 25
  • 47
  • 1
    Base64 decoding the full string gives `<0020002a 00550053 0041002a 00202796 0020d83c 0022007d 005d00>` – that is *almost* valid UTF-16 (but surely not UTF-8). Almost means: `d83c` is a high-surrogate and would require a following low-surrogate. – Martin R Sep 26 '18 at 19:20
  • Hello, thank you for your response! I know virtually nothing about encodings, can you point me to a direction on how to solve this issue? Again, data consistency isn’t of importance at all so if you know an easy but dirty solution on top of your head that would already be perfect – user2875404 Sep 26 '18 at 19:29
  • try encoding ` *USA* ' <"}]` in https://www.base64encode.org/ , What does it give you? and in what destination Character set? – ielyamani Sep 26 '18 at 19:36
  • Could you show the way you are encoding "USA " – ielyamani Sep 26 '18 at 19:40
  • If you don't show how you come up with your input string nobody will be able to help. https://stackoverflow.com/a/43817935/2303865 – Leo Dabus Sep 26 '18 at 19:43
  • @Carpsen90 that string should result in `VVNBIPCfh7rwn4e4` – Leo Dabus Sep 26 '18 at 19:45
  • @LeoDabus Yes, I am getting the same result – ielyamani Sep 26 '18 at 19:48
  • an you add the original String to your question ? – Leo Dabus Sep 26 '18 at 20:02
  • I get the String along with other data `JSON.stringify`ed from a non-public website via JavaScript. I add the string to a dictionary in Swift and then do `let data = NSKeyedArchiver.archivedData(withRootObject: dict)`. However I have just noticed that if I `JSON.stringify` the USA string on its own it contains the emoji. But my production-code js puts it into a dictionary along with other values and when this happens it ignores the emoji. Weirdly it's just this single emoji which is no different from the other ones. I will now try to find the reason for the js ignoring the emoji and if it(1/2) – user2875404 Sep 26 '18 at 20:08
  • affects my swift code. I didn't have this issue with a similar set of data (also non-working emoji) so chances are good it will work either way now but I am still concerned about just being able to ignore the incomplete character tho (2/2) – user2875404 Sep 26 '18 at 20:09
  • 1
    @LeoDabus yeah your statement made me go through the JS again and then I realized how I was able to get the malformed JSON. It's using `substring` on an emoji which ofc consists of multiple letters. So now we know how to reproduce this issue, thanks a lot man. Now if someone happens to know of a way to make Swift not break as soon as it sees an emoji in thirds that would make today's 8-hour-journey on this bug complete – user2875404 Sep 26 '18 at 20:28

1 Answers1

2

(Some of this is out of comments, but trying to bring it together and describe solutions.)

First, your strings are not UTF-8. They're UTF-16 or malformed UTF-16. Sometimes UTF-16 happens to be interpretable as UTF-8, but when it is, there will be NULL characters scattered through the string. In your "working" example, it's not really working.

let toApostrophe = "ACAAKgBVAFMAQQAqACAn"                 // *USA* '//
if let textData = Data(base64Encoded: toApostrophe) {
    if let decodedString = String(data: textData, encoding: .utf8) {
        print(decodedString)
        print(decodedString.count)
        print(decodedString.map { $0.unicodeScalars.map { $0.value } } )
    } else {
        print("DID NOT DECODE UTF8")
    }
} else {
    print("DID NOT DECODE BASE64")
}

Prints:

 *USA* '
15
[[0], [32], [0], [42], [0], [85], [0], [83], [0], [65], [0], [42], [0], [32], [39]]

Note that the length of string is 15 characters, not 8 like you were probably expecting. That's because it includes an extra invisible NULL (0) between most characters.

toEndBracket doesn't happen to be legal UTF-8, however. Here are its bytes:

["00", "20", "00", "2a", "00", "55", "00", "53", "00", "41", "00", "2a", "00", "20", "27", "96", "00", "20", "d8", "3c", "00", "22", "00", "7d", "00", "5d", "00"]

This is ok until it gets to 0xd8. That starts with the bits 110, which indicates that it's the start of a two byte sequence. But the next byte is 0x3c, which is not a valid second byte of a multi-byte sequence (it should start with 10, but it starts with 00). So we can't decode this as UTF-8. Even using decodeCString(_:as:repairingInvalidCodeUnits) can't decode this string because it's filled with embedded NULLs. You've got to decode it using at least the right encoding.

But let's do that. Decode as UTF-16. At least that's close, even though it's slightly invalid UTF-16.

let toEndBracket16 = String(data: toEndBracketData, encoding: .utf16)
// " *USA* ➖ �"}]"

Now we can at least work with this. It's invalid JSON, though. So we can strip that by filtering it:

let legalJSON = String(toEndBracket16.filter { $0 != "\u{FFFD}" })
// " *USA* ➖ "}]"

I don't really recommend this approach. It's incredibly fragile and based on broken input. Fix the input. But in a world where you're trying to parse broken input, these are the tools.

Rob Napier
  • 286,113
  • 34
  • 456
  • 610
  • THANKS SO MUCH FOR THIS!!! Works perfectly! I actually did play around with replacing stuff a couple of hours ago and I did play around back and forth with utf8 and 16 but I would've never thought of mixing those! Works perfectly, also gets serialized into JSON without issues (at least in Playground). However, I did take your concern seriously, and I am trying to clean up the input data now because like you say it's very fragile https://stackoverflow.com/questions/52526719/javascript-substring-without-splitting-emoji but for now this serves me perfectly, I am very thankful for this! – user2875404 Sep 26 '18 at 22:19
  • 1
    Best of luck. The answer recommending finding a library is a good answer. This is a really, really hard problem (you think flags are tricky, try decoding the 25 bytes in the UTF-8 encoding of ‍‍‍). I always laugh when people say stuff like "I don't care about Unicode. I don't need Thai or Urdu or Chinese. I just want English and Emoji." If you can handle Emoji, you've already handled most of the hardest parts. – Rob Napier Sep 27 '18 at 13:02