2

I´m trying to separate a string like the following:

let path = "/Users/user/Downloads/history.csv"

    do {
        let contents = try NSString(contentsOfFile: path, encoding: String.Encoding.utf8.rawValue )
        let rows = contents.components(separatedBy: "\n")

        print("contents: \(contents)")
        print("rows: \(rows)")  

    }
    catch {
    }

I have two files, which are looking almost identical. From the first file the output is like this:

Output File1:

contents: 2017-07-31 16:29:53,0.10109999,9.74414271,0.98513273,0.15%,42302999779,-0.98513273,9.72952650
2017-07-31 16:29:53,0.10109999,0.25585729,0.02586716,0.25%,42302999779,-0.02586716,0.25521765


rows: ["2017-07-31 16:29:53,0.10109999,9.74414271,0.98513273,0.15%,42302999779,-0.98513273,9.72952650", "2017-07-31 16:29:53,0.10109999,0.25585729,0.02586716,0.25%,42302999779,-0.02586716,0.25521765", "", ""]

Output File2:

contents: 40.75013313,0.00064825,5/18/2017 7:17:01 PM

19.04004820,0.00059900,5/19/2017 9:17:03 PM

rows: ["4\00\0.\07\05\00\01\03\03\01\03\0,\00\0.\00\00\00\06\04\08\02\05\0,\05\0/\01\08\0/\02\00\01\07\0 \07\0:\01\07\0:\00\01\0 \0P\0M\0", "\0", "1\09\0.\00\04\00\00\04\08\02\00\0,\00\0.\00\00\00\05\09\09\00\00\0,\0\05\0/\01\09\0/\02\00\01\07\0 \09\0:\01\07\0:\00\03\0 \0P\0M\0", "\0", "\0", "\0"]

So both files are readable as String because the print(content) is working. But as soon as the string gets separated, the second file is not readable anymore. I tried different encodings, but nothing worked. Has anyone an idea, how to force the string to the second file, to remain a readable string?

Saurabh Jain
  • 1,688
  • 14
  • 28
Josch Hazard
  • 323
  • 3
  • 20
  • 1
    It must be related to encoding. Could you upload your raw csv files somewhere ? – nathan Aug 13 '17 at 18:41
  • 2
    Is this related to your previous, now deleted question https://stackoverflow.com/questions/45662712/problems-with-csv-file-type ? – Did you try `CSVReader(stream: stream, codecType: UTF16.self, endian: .big/.little)` as I suggested? – Martin R Aug 13 '17 at 18:43
  • 1
    See https://stackoverflow.com/questions/18851558/ios-whats-the-best-way-to-detect-a-files-encoding for detecting the encoding automatically. – Martin R Aug 13 '17 at 18:45
  • @MartinR Yes, I tried, but it was the same result: `["㜀挀㜀㔀㜀㠀昀㄀ⴀ搀`I´m trying now without CSVReader, thats why I deleted the previous question. Also it´s now closer to the basic problem I guess. – Josch Hazard Aug 13 '17 at 18:53
  • 1
    @JoschHazard: I have viewed your file with a hex editor. It seems that each line uses UTF-16 bigendian encoding. But the lines a separated by `0A 00 0A` instead of `0A 00` or `0A 00 0A 00`. I would consider the file broken. – Martin R Aug 13 '17 at 19:00
  • @MartinR Ok, thank you. I modified the file a little bit to make it shorter. [Here](https://www.file-upload.net/download-12658139/fullOrders4.csv.html) is the original file. – Josch Hazard Aug 13 '17 at 19:05
  • @MartinR I don´t know if there is too much advertising along with the last link. [Here](https://ufile.io/j79ep) one more time. – Josch Hazard Aug 13 '17 at 19:08
  • That seems to be UTF-16, little-endian. – Martin R Aug 13 '17 at 19:14
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/151838/discussion-between-josch-hazard-and-martin-r). – Josch Hazard Aug 13 '17 at 19:17

1 Answers1

2

Your file is apparently UTF-16 (little-endian) encoded:

$ hexdump fullorders4.csv 
0000000 4f 00 72 00 64 00 65 00 72 00 55 00 75 00 69 00
0000010 64 00 2c 00 45 00 78 00 63 00 68 00 61 00 6e 00
0000020 67 00 65 00 2c 00 54 00 79 00 70 00 65 00 2c 00
0000030 51 00 75 00 61 00 6e 00 74 00 69 00 74 00 79 00
...

For ASCII characters, the first byte of the UTF-16 encoding is the ASCII code, and the second byte is zero.

If the file is read as UTF-8 then the zeros are converted to an ASCII NUL character, that is what you see as \0 in the output.

Therefore specifying the encoding as utf16LittleEndian works in your case:

let contents = try NSString(contentsOfFile: path, encoding: String.Encoding.utf16LittleEndian.rawValue)
// or:
let contents = try String(contentsOfFile: path, encoding: .utf16LittleEndian)

There is also a method which tries to detect the used encoding (compare iOS: What's the best way to detect a file's encoding). In Swift that would be

var enc: UInt = 0
let contents = try NSString(contentsOfFile: path, usedEncoding: &enc)
// or:
var enc = String.Encoding.ascii
let contents = try String(contentsOfFile: path, usedEncoding: &enc)

However, in your particular case, that would read the file as UTF-8 again because it is valid UTF-8. Prepending a byte order mark (BOM) to the file (FF FE for UTF-16 little-endian) would solve that problem reliably.

Martin R
  • 529,903
  • 94
  • 1,240
  • 1,382