1

I found this function when looking up how to count lines in a file, but i have no idea how it works.

def _count_generator(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024 * 1024)

with open('test.txt', 'rb') as fp:
    c_generator = _count_generator(fp.raw.read)
    # count each new line
    count = sum(buffer.count(b'\n') for buffer in c_generator)
    print('total lines', count + 1)

I understand that its reading it as a byte object, but i dont understand what the reader(1024 * 1024) does or how exactly the whole thing works

Any help is appreciated Thanks.

Keknokz
  • 37
  • 4
  • This looks like overkill – Mad Physicist Feb 17 '23 at 03:41
  • 5
    This is not idiomatic Python. This looks like it was written by a die-hard C programmer who refuses to learn how to use Python. – Paul M. Feb 17 '23 at 03:45
  • 5
    A die-hard C programmer wouldn't have written a generator. This looks more like it was written by someone who has to be worried about **extremely** long lines causing memory errors with a naive `sum(1 for line in f)`. – user2357112 Feb 17 '23 at 03:49
  • 1
    Yes, that is the purpose of this function, to read long (GB in size) text files and show how man line they are. – Keknokz Feb 17 '23 at 03:51
  • `reader(1024 * 1024)` is `fp.raw.read(1024 * 1024)` which is just reading that many bytes of the file at a time. This looks like it's intended for handling super long lines. A pytthonic case would use `len(f.readlines())` or `f.readline()` with a loop if you aren't worried about file size. – Steven Summers Feb 17 '23 at 03:52
  • The code should not blindly `+ 1` to the line count if the last line ends with a newline character. – blhsing Feb 17 '23 at 04:27

2 Answers2

3

open() returns a file object. Since it's opening the file with rb (read-binary), it returns a io.BufferedReader. The underlying raw buffer can be retrieved via the .raw property, which is a RawIOBase - its method, RawIOBase.read, is passed to _count_generator.

Since _count_generator is a generator it is an iterable. Its purpose is to read 1mb of data in the file and yield that data back to the caller on every invocation until the file is over - when the buffer b is done reader() returns 0 bytes, stopping the loop.

The caller uses that 1mb of data and counts the amount of new lines in it via sum function, over and over again, until the file is exhausted.

tl;dr You are reading a file 1mb at a time and summing its newlines. Why? Because more likely than not you cannot open the entire file since it's too large to be opened all at once in memory.

felipe
  • 7,324
  • 2
  • 28
  • 37
  • 2
    `raw` is not a function and does not return a buffer. It's an attribute that refers to a different class wrapping the same underlying OS file object. Your main point still holds though. Might be worth noting that in most cases, it will read not 1MB but the disk block size. – Mad Physicist Feb 17 '23 at 03:56
  • 1
    Thank you for including links to documentation as well so i can look further into it. Cheers for the help. – Keknokz Feb 17 '23 at 03:57
0

Let's start with the argument to the function. fp.raw.read is the read method of the raw reader of the binary file fp. The read method accepts an integer that tells it how many bytes to read. It returns an empty bytes on EOF.

The function itself is a generator. It lazily calls read to get up to 1MB of data at a time. The chunks are not read until requested by the generator in sum, which counts newlines. Raw read with a positive integer argument will only make one call to the underlying OS, so 1MB is just a hint in this case: most of the time it will read one disk block, usually around 4KB or so.

This program has two immediately apparent flaws if you take the time to read the documentation.

  1. raw is not guaranteed to exist in every implementation of python:

    This is not part of the BufferedIOBase API and may not exist on some implementations.

  2. read in non-blocking mode can return None when no data is available but EOF has not been reached. Only empty bytes indicates EOF, so the while loop should be while b != b'':.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264