0

i have to write a real fast char by char reader in swift. This is my solution so far.

For a 1.4mb file i get it in 0m0.932s. For a 150mb file it took 1m42.931s

Do you know a faster solution?

import Foundation
class CharReader {

let encoding : String.Encoding
let chunkSize : Int
var fileHandle : FileHandle!
let buffer : NSMutableData!
var atEof : Bool = false
var characterPointer: UnsafeMutablePointer<Character>
var startPointer: UnsafeMutablePointer<Character>

var stored_cnt: Int = 0;
var stored_idx: Int = 0;

init?(path: String, encoding: String.Encoding = String.Encoding.utf8, chunkSize : Int = 1024) {
    self.chunkSize = chunkSize
    self.encoding = encoding
    characterPointer = UnsafeMutablePointer<Character>.allocate(capacity: chunkSize)
    startPointer = characterPointer
    if let fileHandle = FileHandle(forReadingAtPath: path),
        let buffer = NSMutableData(capacity: chunkSize){
        self.fileHandle = fileHandle
        self.buffer = buffer
    } else {
        self.fileHandle = nil
        self.buffer = nil
        return nil
    }
}

deinit {
    self.close()
}

func nextChar() -> Character? {

    if atEof {
        return nil
    }

    if stored_cnt > (stored_idx + 1) {
        stored_idx += 1
        let char = characterPointer.pointee
        characterPointer = characterPointer.successor()
        return char
    }

    let tmpData = fileHandle.readData(ofLength: (chunkSize))
    if tmpData.count == 0 {
        atEof = true
        return nil
    }

    if let s = NSString(data: tmpData, encoding: encoding.rawValue) as String! {
        stored_idx = 0
        let characters = s.characters
        stored_cnt = characters.count

        characterPointer = startPointer
        characterPointer.initialize(from: characters)

        let char = characterPointer.pointee
        characterPointer = characterPointer.successor()
        return char
    }
    return nil;
}


/// Close the underlying file. No reading must be done after calling this method.
func close() -> Void {
    fileHandle?.closeFile()
    fileHandle = nil
}

}

please let me know.

I test the class with this main.swfit:

import Foundation

if CommandLine.arguments.count < 2 {
    print("Too less arguments.")
    exit(0)
}
let file = CommandLine.arguments[1]

if let aCharReader = CharReader(path: file) {
defer {
    aCharReader.close()
}
while let char = aCharReader.nextChar() {
    continue
}
}

The Project is on GitHub: https://github.com/petershaw/charsinfile

Thanks a lot, ps

Peter Shaw
  • 1,867
  • 1
  • 19
  • 32
  • 1
    How large is is your file? Can you just read it into a string completely? (Btw., the code looks vaguely familiar, is it derived from http://stackoverflow.com/a/24648951/1187415?) – Martin R Nov 06 '16 at 14:11
  • Depends, From bytes to multiple gigs – Peter Shaw Nov 06 '16 at 14:13
  • 1
    Is it plain ASCII or does it contain arbitrary Unicode characters? – Martin R Nov 06 '16 at 14:14
  • 1
    The above isn't correct for UTF8. It could end a chunk in the middle of a character, and nextChar() would return nil when `NSString` fails to decode. – Rob Napier Nov 06 '16 at 14:22
  • 1
    The "ASCII vs UTF-8" question is going to be the heart of designing a high-speed reader. If it's ASCII, avoiding UTF-8 complexity is a very large performance win. If it's UTF-8, you almost certainly want to do it with UTF8.decode rather than NSString. And if you're allowed to pre-process the file from UTF-8 to something with more reliable boundaries like UTF-16, then that's a win, too. Knowing the constraints is critical; there is no single answer that is fastest for all versions of this problem. – Rob Napier Nov 06 '16 at 14:27
  • Hi Rob, Hi Martin. Well i have any kind of characters in my files, even nasty emoticons . For that i parse the chunk into characters, first. My fist guess was to store them into a array and than return the next item from the array. This was way too slow. I rewrite the code that a Pointer to the characters is used. Maybe you can find more optimisations to reduce the read time. – Peter Shaw Nov 06 '16 at 14:51
  • Martin R, the original line by line code comes from this thread. but I need it char by char, so only some variables still remain the name ;) – Peter Shaw Nov 06 '16 at 14:54
  • Do you need a solution for arbitrary text encodings or do you know it to be UTF-8? – Martin R Nov 06 '16 at 14:57
  • I know it is always UTF-8 – Peter Shaw Nov 06 '16 at 14:58
  • 1
    Have a look at http://stackoverflow.com/a/34595661/1187415. – Martin R Nov 06 '16 at 15:04
  • Yep, your solution is a lot faster. 0.441s vs 0.906s on a 1.4m file, and 44.699s vs 1m35.474s on a 150mb file. Thanks a lot, i will study the code and check a few text examples of mine. Thanks a lot Martin. – Peter Shaw Nov 06 '16 at 15:24
  • That solution is from @RobNapier, not from me :) – Martin R Nov 06 '16 at 15:29
  • oops, sorry. And thanks. I've got some differences from the entry file to the result. The mess starts with emoticons. ;( hm. – Peter Shaw Nov 06 '16 at 15:31
  • I updated the repository and document the test against a plain base64 file: https://github.com/petershaw/charsinfile/blob/master/README.md#testing-the-different-solutions - my solution is very slow ~17m vs. ~26m, but my diff against the source is clear, @RobNapier 's solution is very different with the source. Do I miss something here? – Peter Shaw Nov 06 '16 at 16:23
  • `print()` is probably taking the vast majority of the time here. It's very slow to print one character at a time using `print()`. – Rob Napier Nov 06 '16 at 16:33
  • Ok, deal. But why is yours printing out a different file? Pls see my example, described on a base64 input it's in the readme on github. Faster, yes but not accurate. But I can not figure out why. – Peter Shaw Nov 06 '16 at 16:47
  • 1
    @RobNapier: It should be `stream.read(&buffer, maxLength: buffer.count)`, not `buffer.capacity`. – Martin R Nov 06 '16 at 17:00

1 Answers1

0

i updated the repository with both versions in it: https://github.com/petershaw/charsinfile

With help from Martin I fix the mistake in Rob's code.

I tested a bunch of different files and both versions worked fine. Rob Napier's code is more efficient! Thanks a lot, Rob.

Thanks all of you both to help me figuring out the fastest solution. It's gerate to have a so wonderful and polite community for swift and cocoa related stuff here an so.

Have a great week!

ps

Peter Shaw
  • 1,867
  • 1
  • 19
  • 32
  • I am just curious: Here you say that variant "b" (using the StreamGenerator/UnicodeScalarGenerator from Rob's code) is the fastest. But your recent question http://stackoverflow.com/q/43772575/1187415 refers to variant "a" (using your CharReader). Were there special reasons to continue working on your variant instead of using the faster one? – Martin R May 04 '17 at 11:41