Calculating character count in a plain text file based on file encoding and file size

Question

Background info: I am teaching myself concurrent programming in python, to do this I am implementing a version of grep that splits the task of searching into work units to be executed on separate cores.

I noticed in this question , that grep is able to search quickly due to some optimisations, a key optimisation being that it avoids reading every byte in input files. An example of this is that the input is read into one buffer rather than being split up based on where newlines are found.

I would like to try out splitting large input files into smaller work units but without reading each byte to find new lines or anything similar to determine split points. My plan is to split the input in half (the splits simply being offsets), then split those halves into halves continuing until they are of manageable (possibly predetermined) sizes - naturally to do this you need to know the size of your input.

The Question: is it possible to calculate or estimate the number of characters in a plain text file, if the size of the file is known and the encoding is also known?

Theoretically - hard (i.e. as you can't see if de-normalized characters are in the file - also you may be able to find nearest word boundary if you know that language of your choice use some simple boundaries), practically - yes in many cases (as long as you pick fixed width encoding like UTF32 or have ASCII only text). — Alexei Levenkov, Mar 01 '16 at 02:22
Hey thanks for this, I will do some research into it - haven't reached the stage where I was to optimise to this level but was just wondering if it was even worth considering — The_Neo, Mar 02 '16 at 17:01
For exercises in learning parallel programming - reasonable. For real tool - likely pointless as IO is usually bottleneck itself and actually doing something simple like grep is way faster than reading from the file (try both approaches and measure - which is good learning experience as well). — Alexei Levenkov, Mar 02 '16 at 17:14

Calculating character count in a plain text file based on file encoding and file size

0 Answers0