1

In my app a tcp client is handling a data stream coming from a remote tcp server. Everything works fine while the received characters are 1-byte characters. When the tcp server sends special characters such "ü" (hex "c3b5" -> a 2-byte character), I start to experience issues.

This is the Swift 3 line of code that gets a nil String whenever the received data include some UTF8 characters with more than 1 byte:

let convertedString = String(bytes: data, encoding: String.Encoding.utf8)

Any idea about how could I fix this? Basically the incoming stream could include 1-byte or 2-byte characters encoded as UTF8 and I need to convert the data stream into a String without issues.

Here is the whole portion of code where I'm experiencing the issue:

func startRead(for task: URLSessionStreamTask) {
    task.readData(ofMinLength: 1, maxLength: 65535, timeout: 300) { (data, eof, error) in
        if let data = data {
            NSLog("stream task read %@", data as NSData)

            let convertedString1 = String(data: data, encoding: String.Encoding(rawValue: String.Encoding.utf8.rawValue))

            if let convertedString = String(bytes: data, encoding: String.Encoding.utf8) {

                self.partialMessage = self.partialMessage + convertedString

                NSLog(convertedString)

                // Assign lengths (delimiter, MD5 digest, minimum expected length, message length)
                let delimiterLength = Constants.END_OF_MESSAGE_DELIMITER.lengthOfBytes(using: String.Encoding.utf8)
                let MD5Length = 32 // 32 characters -> hex representation of 16 bytes
                // 3 = CR+LF+1 char at least
                let minimumExpectedMessageLength = MD5Length + delimiterLength + 3
                let messageLength = self.partialMessage.lengthOfBytes(using: String.Encoding.utf8)

                // Check for delimiter and minimum expected message length (2 char msg + MD5 digest + delimiter)
                if (self.partialMessage.contains(Constants.END_OF_MESSAGE_DELIMITER)) &&
                    (messageLength >= minimumExpectedMessageLength) {

                    var message = self.partialMessage

                    // Get rid of optional CR+LF
                    var lowBound = message.index(message.endIndex, offsetBy: -1)
                    var hiBound = message.index(message.endIndex, offsetBy: 0)
                    var midRange = lowBound ..< hiBound

                    let optionalCRLF = message.substring(with: midRange)

                    if (optionalCRLF == "\r\n") || (optionalCRLF == "\0") {  // Remove CR+LF if present
                        lowBound = message.index(message.endIndex, offsetBy: -1)
                        hiBound = message.index(message.endIndex, offsetBy: 0)
                        midRange = lowBound ..< hiBound
                        message.removeSubrange(midRange)
                    }

                    // Check for delimiter proper position (has to be at the end)
                    lowBound = message.index(message.endIndex, offsetBy: -delimiterLength)
                    hiBound = message.index(message.endIndex, offsetBy: 0)
                    midRange = lowBound ..< hiBound

                    let delimiter = message.substring(with: midRange)

                    if (delimiter == Constants.END_OF_MESSAGE_DELIMITER)  // Delimiter in proper position?
                    {
                        // Acquire the MD digest
                        lowBound = message.index(message.endIndex, offsetBy: -(MD5Length+delimiterLength))
                        hiBound = message.index(message.endIndex, offsetBy: -(delimiterLength))
                        midRange = lowBound ..< hiBound
                        let receivedMD5 = message.substring(with: midRange)

                        // Acquire the deframed message (normalized message)
                        lowBound = message.index(message.startIndex, offsetBy: 0)
                        hiBound = message.index(message.endIndex, offsetBy: -(MD5Length+delimiterLength))
                        midRange = lowBound ..< hiBound
                        let normalizedMessage = message.substring(with: midRange)

                        // Calculate the MD5 digest on the normalized message
                        let calculatedMD5Digest = normalizedMessage.md5()

                        // Debug
                        print(delimiter)
                        print(normalizedMessage)
                        print(receivedMD5)
                        print(calculatedMD5Digest!)

                        // Check for the integrity of the data
                        if (receivedMD5.lowercased() == calculatedMD5Digest?.lowercased()) || self.noMD5Check  // TEMPORARY
                        {
                            if (normalizedMessage == "Unauthorized Access")
                            {
                                // Update the authorization status
                                self.authorized = false

                                // Stop the refresh control
                                if let refreshControl = self.refreshControl {
                                    if refreshControl.isRefreshing {
                                        refreshControl.endRefreshing()
                                    }
                                }

                                // Stop the stream
                                NSLog("stream task stop")
                                self.stop(task: task)

                                // Shows an alert
                                self.showAlert(title: NSLocalizedString("Unauthorized Access", comment: "Unauthorized Access Title"), message: NSLocalizedString("Please login with the proper Username and Password before to send any command!", comment: "Unauthorized Access Message"))                                    
                            }
                            else if (normalizedMessage == "System Busy")
                            {
                                // Stop the refresh control
                                if let refreshControl = self.refreshControl {
                                    if refreshControl.isRefreshing {
                                        refreshControl.endRefreshing()
                                    }
                                }

                                // Stop the stream
                                NSLog("stream task stop")
                                self.stop(task: task)

                                // Shows an alert
                                self.showAlert(title: NSLocalizedString("System Busy", comment: "System Busy Title"), message: NSLocalizedString("The system is busy at the moment. Only one connection at a time is allowed!", comment: "System Busy Message"))
                            }
                            else if (normalizedMessage == "Error")
                            {
                                // Stop the refresh control
                                if let refreshControl = self.refreshControl {
                                    if refreshControl.isRefreshing {
                                        refreshControl.endRefreshing()
                                    }
                                }

                                // Stop the stream
                                NSLog("stream task stop")
                                self.stop(task: task)

                                // Shows an alert
                                self.showAlert(title: NSLocalizedString("Error", comment: "Error Title"), message: NSLocalizedString("An error occurred during the execution of the command!", comment: "Command Error Message"))
                            }
                            else if (normalizedMessage == "ErrorMachineRunning")
                            {
                                // Stop the refresh control
                                if let refreshControl = self.refreshControl {
                                    if refreshControl.isRefreshing {
                                        refreshControl.endRefreshing()
                                    }
                                }

                                // Stop the stream
                                NSLog("stream task stop")
                                self.stop(task: task)

                                // Shows an alert
                                self.showAlert(title: NSLocalizedString("Error", comment: "Error Title"), message: NSLocalizedString("The command cannot be executed while the machine is running", comment: "Machine Running Message 1")+"!\r\n\n "+NSLocalizedString("Trying to execute any command in this state could be dangerous for both people and machinery", comment: "Machine Running Message 2")+".\r\n\n "+NSLocalizedString("Please stop the machine and leave the automatic or semi-automatic modes before to provide any command", comment: "Machine Running Message 3")+".")
                            }
                            else if (normalizedMessage == "Command Not Recognized")
                            {
                                // Stop the refresh control
                                if let refreshControl = self.refreshControl {
                                    if refreshControl.isRefreshing {
                                        refreshControl.endRefreshing()
                                    }
                                }

                                // Stop the stream
                                NSLog("stream task stop")
                                self.stop(task: task)

                                // Shows an alert
                                self.showAlert(title: NSLocalizedString("Error", comment: "Error Title"), message: NSLocalizedString("Command not recognized!", comment: "Command Unrecognized Message"))
                            }
                            else
                            {
                                // Stop the refresh control
                                if let refreshControl = self.refreshControl {
                                    if refreshControl.isRefreshing {
                                        refreshControl.endRefreshing()
                                    }
                                }

                                // Stop the stream
                                NSLog("stream task stop")
                                self.stop(task: task)

                                //let testMessage = "test\r\nf3ea0b9bff4a2c79e60acf6873f4a1ce</EOM>\r\n"
                                //normalizedMessage = testMessage

                                // Process the received csv file
                                self.processCsvData(file: normalizedMessage)
                            }
                        }
                        else
                        {
                            // Stop the refresh control
                            if let refreshControl = self.refreshControl {
                                if refreshControl.isRefreshing {
                                    refreshControl.endRefreshing()
                                }
                            }

                            // Stop the stream
                            NSLog("stream task stop")
                            self.stop(task: task)

                            // Shows an alert
                            self.showAlert(title: NSLocalizedString("Data Error", comment: "Data Error Title"), message: NSLocalizedString("The received data cannot be read since it's corrupted or incomplete!", comment: "Data Error Message"))
                        }

                    }
                    else
                    {
                        // Stop the refresh control
                        if let refreshControl = self.refreshControl {
                            if refreshControl.isRefreshing {
                                refreshControl.endRefreshing()
                            }
                        }

                        // Stop the stream
                        NSLog("stream task stop")
                        self.stop(task: task)

                        // Shows an alert
                        self.showAlert(title: NSLocalizedString("Data Error", comment: "Data Error Title"), message: NSLocalizedString("The received data cannot be read since it's corrupted or incomplete!", comment: "Data Error Message"))
                    }
                }
            }
        }
        if eof {
            // Stop the refresh control
            if let refreshControl = self.refreshControl {
                if refreshControl.isRefreshing {
                    refreshControl.endRefreshing()
                }
            }

            // Refresh the tableview content
            self.tableView.reloadData()

            // Stop the stream
            NSLog("stream task end")
            self.stop(task: task)

        } else if error == nil {
            self.startRead(for: task)
        } else {
            // We ignore the error because we'll see it again in `didCompleteWithError`.
            NSLog("stream task read error")
        }
    }
}
Salva
  • 707
  • 2
  • 9
  • 18
  • Is `data` the complete string data or just data for part of the string? – rmaddy Dec 15 '16 at 23:38
  • data is coming from the function implementing the receiving buffer handling, therefore the bytes are not necessarily arriving all at the same time. I use the converted string to build up the whole message, and as soon as I detect the termination character (I implemented a simple protocol) I close the socket and analyze the received message. As soon as some characters represented by more than 1 byte is received in the buffer, the receiving buffer is locked since the convertedString cannot be created anymore (I get a nil). – Salva Dec 15 '16 at 23:49
  • See my answer which explains the flaw in your approach and what you must do. – rmaddy Dec 15 '16 at 23:53
  • Thank you rmaddy, I've added a more complete portion of code, to show you how the receiving of the data is accomplished. As you can see I use the data I'm receiving to build a partial message until I get the "end of message" delimiter that I use to close the socket and start the analysis. I'm not that experienced with data streams, so do you have any suggestions about I could I modify my function to receive the whole bunch of bytes before to convert to string? At the moment I need to look for the delimiter code as a string to mark the end of the message reception.. – Salva Dec 16 '16 at 00:07
  • Normally you should start your data with a byte count. Say the first 4 bytes represent a 32-bit integer in some agreed upon "endianness". You read those 4 bytes to get the length. Then you read data until you get that many more bytes. Then you know you are at the end of the message. The problem with trying to use an "end of message" marker at the end of your data is that the "end of message" marker could be split across reads. Either way, you need to refactor your code to process at the data level and not make any attempt to convert the data to a string until all of the string data is read. – rmaddy Dec 16 '16 at 00:15
  • Thank you very much for your support! I'll try to implement what you suggest.. – Salva Dec 16 '16 at 00:18
  • I believe this is a duplicate of http://stackoverflow.com/questions/34595070/what-is-a-safe-way-to-turn-streamed-utf8-data-into-a-string/34595661#34595661, but I'm not so certain that I'm ready to pull out the dupe-hammer (when I vote dupe, it closes). Have I missed anything, or is this just a dupe? – Rob Napier Dec 16 '16 at 01:21
  • Thanks for your feedback Rob Napier. It's a quite similar topic indeed (I missed that, and I'm checking also that question) but in this case it seems that the conversation is kind of moving to the two possible strategies you can use when decoding a UFT8 data stream in Swift 3 to correctly decode multi-byte characters: "number of bytes to be transmitted in the header of the message" vs "end-of-message delimiter at the end of the message". Thanks! – Salva Dec 16 '16 at 06:25

2 Answers2

2

It's critical that data represents the data for the entire string, not just a substring. If you are attempting to convert substrings from partial data of the entire string, it will fail in many cases.

It works with 1-byte characters because no matter where you chop the data stream, the partial data still represents a valid string. But once you start dealing with multi-byte characters, a partial data stream could easily result in the first or last byte of the data being only part of a multi-byte character. This prevents the data from being interpreted properly.

So you must ensure that you build up a data object with all of the bytes of a given string before attempting to convert the data into a string.

Normally you should start your data with a byte count. Say the first 4 bytes represent a 32-bit integer in some agreed upon "endianness". You read those 4 bytes to get the length. Then you read data until you get that many more bytes. Then you know you are at the end of the message.

The problem with trying to use an "end of message" marker at the end of your data is that the "end of message" marker could be split across reads. Either way, you need to refactor your code to process at the data level and not make any attempt to convert the data to a string until all of the string data is read.

rmaddy
  • 314,917
  • 42
  • 532
  • 579
0

As you know, single UTF-8 character is either in 1, 2, 3 or 4 bytes. For your case, you need to handle 1 or 2 byte characters. And your receiving byte sequence may not be aligned to "character boundary". However, as rmaddy pointed, the byte sequence to String.Encoding.utf8 must start and end with right boundary.

Now, there are two options to handle this situation. One is, as rmaddy suggests, to send length at first and count incoming data bytes. The drawback of this is that you have to modify transmit (server) side as well, which may not be possible.

Another option is to scan incoming sequence byte-by-byte and keep track the character boundary, then build up legitimate UTF-8 byte sequence. Fortunately, UTF-8 is designed so that you can easily identify where the character boundary is by seeing ANY byte in byte stream. Specifically, first byte of 1, 2, 3 and 4 byte UTF-8 character starts with 0xxxxxxx, 110xxxxx, 1110xxxx and 11110xxx respectively, and second..fourth bytes are all in 10xxxxxx in bit representation. This makes your life a lot easier.

If you pick up your "end of message" marker from one of 1 byte UTF-8 characters, you can easily and successfully detect EOM w/o considering byte sequence since it's a single byte and doesn't appear anywhere in 2..4 byte chars.

beshio
  • 794
  • 2
  • 7
  • 17
  • Thank you beshio... as a matter of fact it's what I was thinking (the work I have to do adapting both the server and the client to change my approach handling messages boundaries). Your solution seems very interesting and I'll evaluate this. My end-of-message delimiter is actually the following string: "" (kind of xml-alike). It's still a bunch of 1-byte characters, therefore your suggestion seems to be applicable.. I guess I need to scan out of the incoming data stream groups of 6-bytes to identify the delimiter and create the whole string at the very end of the process. – Salva Dec 16 '16 at 06:33
  • As long as you detect your EOM at "receive buffer" level, meaning before converting to Swift String, I think your six byte EOM sequence approach should work since all of the six characters are 1 byte UTF-8, which should not be mixed up with any others, even if interrupted on the way, as far as you well manage receive buffer. – beshio Dec 16 '16 at 14:20
  • However, single byte EOM is stateless and handy, which I mean we can terminate immediately w/o worrying about the "future/past". You can use both EOMs, if you want, but it's pretty much dependent on your application and the reliability of your server-client connection. For example, you can use single byte EOM such as 0x03 (End Of Text char) as "low level protocol event" to close TCP session ANYWAY, while keeping "" as "higher level protocol" regular event and gracefully do some task before closing TCP session. – beshio Dec 16 '16 at 14:20
  • Hi beshio, it sounds great, thank you. I didn't think about the possibility to use as a delimiter directly the escape character 0x03. I'll try what you suggest asap. Thank you again for the great suggestions! ;-) – Salva Dec 16 '16 at 18:27
  • I hope it works. And I would appreciate if you could accept my answer if it is helpful. Thanks ! – beshio Dec 17 '16 at 12:28
  • Sorry I'm new here and I still have to figure out how it works.. ;-) done! – Salva Dec 17 '16 at 14:02