Background info: I am teaching myself concurrent programming in python, to do this I am implementing a version of grep that splits the task of searching into work units to be executed on separate cores.
I noticed in this question , that grep is able to search quickly due to some optimisations, a key optimisation being that it avoids reading every byte in input files. An example of this is that the input is read into one buffer rather than being split up based on where newlines are found.
I would like to try out splitting large input files into smaller work units but without reading each byte to find new lines or anything similar to determine split points. My plan is to split the input in half (the splits simply being offsets), then split those halves into halves continuing until they are of manageable (possibly predetermined) sizes - naturally to do this you need to know the size of your input.
The Question: is it possible to calculate or estimate the number of characters in a plain text file, if the size of the file is known and the encoding is also known?