2

I have a few hundred files, each a between 10s of MB and a few GB in size, and I'd like to estimate the number of lines (i.e. an exact count is not needed). Each line is very regular, for example something like 4 long ints and 5 double floats.

I tried to find the average size of the first AVE_OVER lines in a file, then use that to estimate the total number of lines:

nums = sum(1 for line in open(files[0]))
print "Number of lines = ", nums

AVE_OVER = 10
lineSize = 0.0
count = 0
for line in open(files[0]):
    lineSize += sys.getsizeof(line)
    count += 1
    if( count >= AVE_OVER ): break

lineSize /= count
fileSize = os.path.getsize(files[0])
numLines = fileSize/lineSize
print "Estimated number of lines = ", numLines

The estimate was way off:

> Number of lines =  505235
> Estimated number of lines =  324604.165863

So I tried counting the total size of all lines in the file, compared to the sys measured size:

fileSize = os.path.getsize(files[0])
totalLineSize = 0.0
for line in open(files[0]):
totalLineSize += sys.getsizeof(line)

print "File size = %.3e" % (fileSize)
print "Total Line Size = %.3e" % (totalLineSize)

But again these are discrepant!

> File size = 3.366e+07
> Total Line Size = 5.236e+07

Why is the sum of sizes of each lines so much larger than the actual total file size? How can I correct for this?


Edit: Algorithm I ended up with (ver 2.0); Thanks to @J.F.Sebastian

def estimateLines(files):
    """ Estimate the number of lines in the given file(s) """

    if( not np.iterable(files) ): files = [files]
    LEARN_SIZE = 8192

    # Get total size of all files                                                                                                                                                                   
    numLines = sum( os.path.getsize(fil) for fil in files )

    with open(files[0], 'rb') as file:
         buf = file.read(LEARN_SIZE)
         numLines /= (len(buf) // buf.count(b'\n'))

    return numLines
DilithiumMatrix
  • 17,795
  • 22
  • 77
  • 119
  • @OlehPrypin " \ Docstring: getsizeof(object, default) -> int \ Return the size of object in bytes." Yeah, God, I'm so dumb for not intuiting that this wouldn't give me the size of the object I was looking at... How could I be so stupid?! Thanks for the downvote. – DilithiumMatrix Dec 24 '14 at 22:55
  • 1
    `line.count(b'\n')` is `1` (or `0` if there is no newline at the end of the file). Don't use together `for line in file` and `line.count(b'\n')`: it is useless. Use one or the other. The latter is faster. `def estimateLines(filename): return os.path.getsize(filename) // line_size_hint(filename)` – jfs Dec 25 '14 at 05:13
  • @J.F.Sebastian thanks for the feedback. I was trying to implement an average over the first few lines instead --- but now I understand yours already does that (well, up to 8192 bytes). – DilithiumMatrix Dec 25 '14 at 05:19
  • If you want to avoid estimating the line size for a each file and use the line size hint computed based on the first file then to estimate number of lines in several files, you could use `numLines = sum(map(os.path.getsize, files)) // line_size_hint(files[0])`. – jfs Dec 25 '14 at 05:22
  • *1)* Remove `if not np.iterable(files): files = [files]`. It does nothing because a `str`/`unicode` instances are iterable (types that are acceptable by `open()` in Python 2). Do you expect a (buffer) type that is not iterable but that is accepted by `open()`? *2)* Use `//=` for compatibility with Python 3. *3)* Why don't you want to refactor the code fragment into `line_size_hint()` function (the function call overhead should be negligible compared to all the I/O)? *4)* Don't put *answer* (the solution) in to the question, post it as an answer instead. – jfs Dec 25 '14 at 06:19

2 Answers2

3

To estimate number of lines in a file:

def line_size_hint(filename, learn_size=1<<13):
    with open(filename, 'rb') as file:
        buf = file.read(learn_size)
        return len(buf) // buf.count(b'\n')

number_of_lines_approx = os.path.getsize(filename) // line_size_hint(filename)

To find the exact number of lines, you could use wc-l.py script:

#!/usr/bin/env python
import sys
from functools import partial

print(sum(chunk.count('\n') for chunk in iter(partial(sys.stdin.read, 1 << 15), '')))
Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
0

sys.getsizeof is the sole cause of problems here. It gives arbitrary implementation-dependent sizes of objects and shouldn't be used at all, except for very rare cases.

Just open the file as binary and get actual lengths of the lines using len.

Oleh Prypin
  • 33,184
  • 10
  • 89
  • 99